fix: retry on REPLICATE_VIOLATION for global cluster region switch#3285
Open
huanghaoyuanhhy wants to merge 1 commit intomilvus-io:masterfrom
Open
Conversation
When a Global Cluster switches its primary region, write operations to the old primary fail with STREAMING_CODE_REPLICATE_VIOLATION. Previously this error was not handled in the retry decorator, causing writes to fail for up to 5 minutes until the background topology refresher ran. Add _handle_global_routing_error() to detect REPLICATE_VIOLATION and trigger an immediate topology refresh with retry, enabling automatic recovery in seconds instead of minutes. Signed-off-by: huanghaoyuanhhy <haoyuan.huang@zilliz.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: huanghaoyuanhhy The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @huanghaoyuanhhy! It looks like this is your first PR to milvus-io/pymilvus 🎉 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
STREAMING_CODE_REPLICATE_VIOLATIONMilvusExceptionwas not handled in the retry decorator, so writes failed for up to 5 minutes until the backgroundTopologyRefresher(300s interval) detected the change_handle_global_routing_error()inGrpcHandlerto detectREPLICATE_VIOLATIONand trigger immediate topology refreshretry_on_rpc_failuredecorator'sMilvusExceptionbranch (both sync and async) so the operation retries automatically after refreshTest plan
REPLICATE_VIOLATIONfor ~5 minutes until background refresh_handle_global_routing_error(4 new tests, all passing)