feat(shard-manager): Add support for watching drains#7697
feat(shard-manager): Add support for watching drains#7697gazi-yestemirova wants to merge 5 commits intocadence-workflow:masterfrom
Conversation
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
jakobht
left a comment
There was a problem hiding this comment.
As far as I understand this will only allow us to drain the leader. If we later want to undrain we have to restart the service.
Did I miss something?
| //go:generate mockgen -package $GOPACKAGE -source $GOFILE -destination drain_observer_mock.go . DrainSignalObserver | ||
|
|
||
| // DrainSignalObserver observes infra drain signals | ||
| // When a drain is detected, ShouldStop() returns a closed channel. |
There was a problem hiding this comment.
This comment is confusing to me.
I guess Should Stop returns a channel that will be closed in case we need to drain. Maybe we could call it
DrainChan()
that is correct yes, currently it is one-way behaviour. Do we also want to support auto-recovery? then I think we need have Active and Draining states within the interface |
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
🔍 CI failure analysis for 9c03778: Integration test failed due to replication connectivity timeout during test cleanup - completely unrelated to PR changes in shard-distributor namespace managerIssueThe "Golang integration test with sqlite" job failed after running for ~680 seconds with a replication error during test cleanup. Root CauseThis is unrelated to the PR changes. The failure is a replication/infrastructure issue, not caused by the DrainSignalObserver implementation. Evidence:
DetailsThe integration test suite ran for ~11 minutes (680s) and was shutting down when the replication processor failed to connect to a standby cluster. This is a network connectivity/test infrastructure issue during cleanup, completely unrelated to the shard-distributor drain observer functionality. The drain observer changes affect leadership election behavior in the shard-distributor service, which has no connection to cross-cluster replication task processing. Code Review 👍 Approved with suggestions 1 resolved / 2 findingsClean state machine refactor with well-designed drain/undrain observer pattern. Previous finding about comment/code mismatch is resolved; the minor concern about no backoff on retry loop remains unresolved. 💡 Edge Case: No backoff on retry could cause tight spin loop📄 service/sharddistributor/leader/namespace/manager.go:185-189 When Consider adding a small backoff (e.g., ✅ 1 resolved✅ Bug: Comment says retry on error but code stops permanently
Rules ✅ All requirements metRepository Rules
Tip Comment OptionsAuto-apply is off → Gitar will not commit updates to this branch. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
What changed?
This PR introduces DrainSignalObserver interface in clientcommon to allow shard-distributor components to react to infrastructure drain signals.
DrainSignalObserver is a simple interface that allows deployment-specific implementations to signal when this instance has been removed from or added back to service discovery. The leader namespace manager subscribes to drain and signal to proactively resign from etcd elections, it also listens to undrain signal to resume back leadership operations to campain again for the namespace.
Why?
The shard-distributor leader holds an etcd lease to coordinate shard assignments across all executors. In production environments, infrastructure operations (e.g. host drains) can remove a service instance from service discovery while the process continues running. Without active detection, the leader in a drained zone continues holding its etcd lease and operating normally - unaware that it is no longer reachable by other components.
How did you test it?
Added unit tests and checked with
go test -v ./service/sharddistributor/leader/namespacePotential risks
NA
Release notes
NA
Documentation Changes
NA
Reviewer Validation
PR Description Quality (check these before reviewing code):
go testinvocation)