fix: force promote RPC hangs forever due to missing doAckCallback invocation#48535
fix: force promote RPC hangs forever due to missing doAckCallback invocation#48535bigsheeper wants to merge 2 commits intomilvus-io:masterfrom
Conversation
…ocation The force promote path in triggerAckCallback launched a goroutine for doForcePromoteFixIncompleteBroadcasts but never called doAckCallback afterward. doAckCallback is the only function that closes the broadcast task's done channel (via MarkAckCallbackDone), so broadcastScheduler's BlockUntilDone() blocked forever, leaving the UpdateReplicateConfiguration RPC permanently hung. Fix: chain doAckCallback after doForcePromoteFixIncompleteBroadcasts in the same goroutine. doAckCallback already handles g.Unlock() via its defer, so the separate unlock defer is also removed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bigsheeper The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
[ci-v2-notice] To rerun ci-v2 checks, comment with:
If you have any questions or requests, please contact @zhikunyao. |
handleForcePromote was ignoring the configuration provided by the caller entirely. The package-level validateForcePromoteConfiguration existed to validate it but was never called, making it dead code. Wire it in: after the secondary-role check, validate that the caller supplied exactly the current cluster with no cross-cluster topology. Empty clusters or extra clusters/topology are now rejected with a clear error, matching the documented API contract. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #48535 +/- ##
===========================================
+ Coverage 74.87% 83.98% +9.11%
===========================================
Files 1484 627 -857
Lines 246812 103896 -142916
===========================================
- Hits 184797 87259 -97538
+ Misses 53665 16637 -37028
+ Partials 8350 0 -8350
🚀 New features to boost your workflow:
|
|
/ci-rerun-gosdk |
What does this PR do?
Fixes a hang in the
UpdateReplicateConfigurationRPC whenforce_promote=Trueis set.Root Cause
In
ack_callback_scheduler.go,triggerAckCallbackhandles force promote messages by launching a goroutine to calldoForcePromoteFixIncompleteBroadcasts. The goroutine had its owndefer g.Unlock()but never calleddoAckCallback.doAckCallbackis the only function that invokesMarkAckCallbackDone(), which closes the broadcast task'sdonechannel.broadcastScheduler.AddTaskcallstask.BlockUntilDone()on this channel, so withoutdoAckCallbackbeing called,BlockUntilDone()blocks forever — leaving the force promote RPC permanently hung.Fix
Chain
doAckCallback(task, g)afterdoForcePromoteFixIncompleteBroadcasts(task)in the same goroutine.doAckCallbackalready handlesg.Unlock()via its internal defer, so the separate unlock defer in the goroutine is also removed.Tests
Verified with E2E tests covering the full force promote lifecycle:
issue: #47352