Skip to content

fix: force promote RPC hangs forever due to missing doAckCallback invocation#48535

Open
bigsheeper wants to merge 2 commits intomilvus-io:masterfrom
bigsheeper:worktree-fix-force-promote-rpc-hang
Open

fix: force promote RPC hangs forever due to missing doAckCallback invocation#48535
bigsheeper wants to merge 2 commits intomilvus-io:masterfrom
bigsheeper:worktree-fix-force-promote-rpc-hang

Conversation

@bigsheeper
Copy link
Contributor

What does this PR do?

Fixes a hang in the UpdateReplicateConfiguration RPC when force_promote=True is set.

Root Cause

In ack_callback_scheduler.go, triggerAckCallback handles force promote messages by launching a goroutine to call doForcePromoteFixIncompleteBroadcasts. The goroutine had its own defer g.Unlock() but never called doAckCallback.

doAckCallback is the only function that invokes MarkAckCallbackDone(), which closes the broadcast task's done channel. broadcastScheduler.AddTask calls task.BlockUntilDone() on this channel, so without doAckCallback being called, BlockUntilDone() blocks forever — leaving the force promote RPC permanently hung.

Fix

Chain doAckCallback(task, g) after doForcePromoteFixIncompleteBroadcasts(task) in the same goroutine. doAckCallback already handles g.Unlock() via its internal defer, so the separate unlock defer in the goroutine is also removed.

// Before (buggy):
go func() {
    defer func() {
        s.rkLockerMu.Lock()
        g.Unlock()
        s.rkLockerMu.Unlock()
    }()
    s.doForcePromoteFixIncompleteBroadcasts(task)
}()

// After (fixed):
go func() {
    s.doForcePromoteFixIncompleteBroadcasts(task)
    s.doAckCallback(task, g)  // closes done channel, unblocks BlockUntilDone()
}()

Tests

Verified with E2E tests covering the full force promote lifecycle:

  • Force promote on primary cluster rejected
  • Force promote with non-empty clusters/topology rejected
  • Force promote succeeds on secondary with data integrity (1000 rows replicated + 500 new rows post-promote)
  • Force promote on already-promoted cluster rejected

issue: #47352

…ocation

The force promote path in triggerAckCallback launched a goroutine for
doForcePromoteFixIncompleteBroadcasts but never called doAckCallback
afterward. doAckCallback is the only function that closes the broadcast
task's done channel (via MarkAckCallbackDone), so broadcastScheduler's
BlockUntilDone() blocked forever, leaving the UpdateReplicateConfiguration
RPC permanently hung.

Fix: chain doAckCallback after doForcePromoteFixIncompleteBroadcasts in
the same goroutine. doAckCallback already handles g.Unlock() via its
defer, so the separate unlock defer is also removed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
@sre-ci-robot sre-ci-robot requested review from chyezh and zwd1208 March 26, 2026 03:45
@sre-ci-robot sre-ci-robot added the size/S Denotes a PR that changes 10-29 lines. label Mar 26, 2026
@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bigsheeper
To complete the pull request process, please assign yanliang567 after the PR has been reviewed.
You can assign the PR to them by writing /assign @yanliang567 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mergify mergify bot added dco-passed DCO check passed. kind/bug Issues or changes related a bug labels Mar 26, 2026
@sre-ci-robot
Copy link
Contributor

[ci-v2-notice]
Notice: New ci-v2 system is enabled for this PR.

To rerun ci-v2 checks, comment with:

  • /ci-rerun-code-check // for ci-v2/code-check
  • /ci-rerun-build // for ci-v2/build
  • /ci-rerun-build-all // for ci-v2/build-all (multi-arch builds)
  • /ci-rerun-buildenv // for ci-v2/build-env (build milvus-env builder images)
  • /ci-rerun-ut-integration // for ci-v2/ut-integration, will rerun ci-v2/build
  • /ci-rerun-ut-go // for ci-v2/ut-go, will rerun ci-v2/build
  • /ci-rerun-ut-cpp // for ci-v2/ut-cpp
  • /ci-rerun-ut // for all ci-v2/ut-integration, ci-v2/ut-go, ci-v2/ut-cpp, will rerun ci-v2/build
  • /ci-rerun-e2e-arm // for ci-v2/e2e-arm
  • /ci-rerun-e2e-default // for ci-v2/e2e-default
  • /ci-rerun-ciloop // for ci-v2/ciloop (build + unit tests in one pipeline)
  • /ci-rerun-gosdk // for ci-v2/go-sdk (Go SDK E2E tests, AMD)
  • /ci-rerun-gosdk-arm // for ci-v2/go-sdk-arm (Go SDK E2E tests, ARM)

If you have any questions or requests, please contact @zhikunyao.

handleForcePromote was ignoring the configuration provided by the caller
entirely. The package-level validateForcePromoteConfiguration existed to
validate it but was never called, making it dead code.

Wire it in: after the secondary-role check, validate that the caller
supplied exactly the current cluster with no cross-cluster topology.
Empty clusters or extra clusters/topology are now rejected with a clear
error, matching the documented API contract.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
@sre-ci-robot sre-ci-robot added size/M Denotes a PR that changes 30-99 lines. low-code-coverage add test-label from zhikun, diff coverage > 80% and removed size/S Denotes a PR that changes 10-29 lines. labels Mar 26, 2026
@codecov
Copy link

codecov bot commented Mar 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.98%. Comparing base (8c8062f) to head (e595e25).
⚠️ Report is 19 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #48535       +/-   ##
===========================================
+ Coverage   74.87%   83.98%    +9.11%     
===========================================
  Files        1484      627      -857     
  Lines      246812   103896   -142916     
===========================================
- Hits       184797    87259    -97538     
+ Misses      53665    16637    -37028     
+ Partials     8350        0     -8350     
Components Coverage Δ
Client ∅ <ø> (∅)
Core 83.98% <ø> (∅)
Go ∅ <ø> (∅)
see 2111 files with indirect coverage changes
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sre-ci-robot sre-ci-robot removed the low-code-coverage add test-label from zhikun, diff coverage > 80% label Mar 26, 2026
@zhikunyao
Copy link
Collaborator

/ci-rerun-gosdk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-passed DCO check passed. kind/bug Issues or changes related a bug size/M Denotes a PR that changes 30-99 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants