Skip to content

Comments

fix: retry on REPLICATE_VIOLATION for global cluster region switch#3285

Open
huanghaoyuanhhy wants to merge 1 commit intomilvus-io:masterfrom
huanghaoyuanhhy:fix/global-cluster-replicate-violation-retry
Open

fix: retry on REPLICATE_VIOLATION for global cluster region switch#3285
huanghaoyuanhhy wants to merge 1 commit intomilvus-io:masterfrom
huanghaoyuanhhy:fix/global-cluster-replicate-violation-retry

Conversation

@huanghaoyuanhhy
Copy link

Summary

  • When a Global Cluster switches its primary region, write operations to the old primary (now secondary) fail with STREAMING_CODE_REPLICATE_VIOLATION
  • Previously this MilvusException was not handled in the retry decorator, so writes failed for up to 5 minutes until the background TopologyRefresher (300s interval) detected the change
  • Add _handle_global_routing_error() in GrpcHandler to detect REPLICATE_VIOLATION and trigger immediate topology refresh
  • Hook it into the retry_on_rpc_failure decorator's MilvusException branch (both sync and async) so the operation retries automatically after refresh

Test plan

  • Deploy a Global Cluster on Zilliz Cloud (UAT)
  • Run continuous insert + search loop
  • Switch primary region in console
  • Before fix: INSERT fails with REPLICATE_VIOLATION for ~5 minutes until background refresh
  • After fix: INSERT auto-recovers in ~10 seconds (topology refresh + retry backoff)
  • Unit tests for _handle_global_routing_error (4 new tests, all passing)

When a Global Cluster switches its primary region, write operations to
the old primary fail with STREAMING_CODE_REPLICATE_VIOLATION. Previously
this error was not handled in the retry decorator, causing writes to
fail for up to 5 minutes until the background topology refresher ran.

Add _handle_global_routing_error() to detect REPLICATE_VIOLATION and
trigger an immediate topology refresh with retry, enabling automatic
recovery in seconds instead of minutes.

Signed-off-by: huanghaoyuanhhy <haoyuan.huang@zilliz.com>
@sre-ci-robot sre-ci-robot requested a review from czs007 February 17, 2026 05:11
@sre-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: huanghaoyuanhhy
To complete the pull request process, please assign longjiquan after the PR has been reviewed.
You can assign the PR to them by writing /assign @longjiquan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot
Copy link

Welcome @huanghaoyuanhhy! It looks like this is your first PR to milvus-io/pymilvus 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants