Skip to content

feat(shard-manager): Add support for watching drains#7697

Open
gazi-yestemirova wants to merge 5 commits intocadence-workflow:masterfrom
gazi-yestemirova:udg_drain
Open

feat(shard-manager): Add support for watching drains#7697
gazi-yestemirova wants to merge 5 commits intocadence-workflow:masterfrom
gazi-yestemirova:udg_drain

Conversation

@gazi-yestemirova
Copy link
Contributor

@gazi-yestemirova gazi-yestemirova commented Feb 11, 2026

What changed?
This PR introduces DrainSignalObserver interface in clientcommon to allow shard-distributor components to react to infrastructure drain signals.
DrainSignalObserver is a simple interface that allows deployment-specific implementations to signal when this instance has been removed from or added back to service discovery. The leader namespace manager subscribes to drain and signal to proactively resign from etcd elections, it also listens to undrain signal to resume back leadership operations to campain again for the namespace.

Why?
The shard-distributor leader holds an etcd lease to coordinate shard assignments across all executors. In production environments, infrastructure operations (e.g. host drains) can remove a service instance from service discovery while the process continues running. Without active detection, the leader in a drained zone continues holding its etcd lease and operating normally - unaware that it is no longer reachable by other components.

How did you test it?
Added unit tests and checked with go test -v ./service/sharddistributor/leader/namespace

Potential risks
NA

Release notes
NA

Documentation Changes
NA


Reviewer Validation

PR Description Quality (check these before reviewing code):

  • "What changed" provides a clear 1-2 line summary
    • Project Issue is linked
  • "Why" explains the full motivation with sufficient context
  • Testing is documented:
    • Unit test commands are included (with exact go test invocation)
    • Integration test setup/commands included (if integration tests were run)
    • Canary testing details included (if canary was mentioned)
  • Potential risks section is thoughtfully filled out (or legitimately N/A)
  • Release notes included if this completes a user-facing feature
  • Documentation needs are addressed (or noted if uncertain)

Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
@gazi-yestemirova gazi-yestemirova changed the title Support zone drain feat(shard-manager): Add support for watching drains Feb 12, 2026
Copy link
Member

@jakobht jakobht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand this will only allow us to drain the leader. If we later want to undrain we have to restart the service.

Did I miss something?

//go:generate mockgen -package $GOPACKAGE -source $GOFILE -destination drain_observer_mock.go . DrainSignalObserver

// DrainSignalObserver observes infra drain signals
// When a drain is detected, ShouldStop() returns a closed channel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is confusing to me.

I guess Should Stop returns a channel that will be closed in case we need to drain. Maybe we could call it

DrainChan()

@gazi-yestemirova
Copy link
Contributor Author

As far as I understand this will only allow us to drain the leader. If we later want to undrain we have to restart the service.

Did I miss something?

that is correct yes, currently it is one-way behaviour. Do we also want to support auto-recovery? then I think we need have Active and Draining states within the interface

Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
@cadence-workflow cadence-workflow deleted a comment from gitar-bot bot Feb 16, 2026
@gitar-bot
Copy link

gitar-bot bot commented Feb 16, 2026

🔍 CI failure analysis for 9c03778: Integration test failed due to replication connectivity timeout during test cleanup - completely unrelated to PR changes in shard-distributor namespace manager

Issue

The "Golang integration test with sqlite" job failed after running for ~680 seconds with a replication error during test cleanup.

Root Cause

This is unrelated to the PR changes. The failure is a replication/infrastructure issue, not caused by the DrainSignalObserver implementation.

Evidence:

  1. Changed files are unrelated: PR only modifies shard-distributor namespace manager files:

    • service/sharddistributor/client/clientcommon/drain_observer.go
    • service/sharddistributor/client/clientcommon/drain_observer_mock.go
    • service/sharddistributor/leader/namespace/manager.go
    • service/sharddistributor/leader/namespace/manager_test.go
  2. Error is in replication layer: The failure occurs in domain_replication_processor.go - the replicator component trying to fetch replication tasks from a standby cluster

  3. Error message indicates infrastructure issue:

    Failed to get replication tasks: "round-robin" peer list has 1 peer but it is not responsive, 
    timed out waiting for a connection to open: context canceled
    
  4. Timing suggests test cleanup: Error occurs at 2026-02-16T11:45:57 during test teardown when contexts are being canceled and task lists are stopping

Details

The integration test suite ran for ~11 minutes (680s) and was shutting down when the replication processor failed to connect to a standby cluster. This is a network connectivity/test infrastructure issue during cleanup, completely unrelated to the shard-distributor drain observer functionality.

The drain observer changes affect leadership election behavior in the shard-distributor service, which has no connection to cross-cluster replication task processing.

Code Review 👍 Approved with suggestions 1 resolved / 2 findings

Clean state machine refactor with well-designed drain/undrain observer pattern. Previous finding about comment/code mismatch is resolved; the minor concern about no backoff on retry loop remains unresolved.

💡 Edge Case: No backoff on retry could cause tight spin loop

📄 service/sharddistributor/leader/namespace/manager.go:185-189

When campaign() returns stateActive to retry (e.g., line 188 on unexpected channel close), the state machine immediately re-enters campaign() with no backoff delay. If the elector creation or the election channel consistently fails, this creates a hot spin loop that will flood logs and consume CPU.

Consider adding a small backoff (e.g., time.Sleep with exponential backoff or a fixed delay like 1-5 seconds) before retrying, or use a time.After in the state machine loop. This is especially relevant if the CreateElector error (line 173) is changed to stateActive per the other finding.

✅ 1 resolved
Bug: Comment says retry on error but code stops permanently

📄 service/sharddistributor/leader/namespace/manager.go:170-174
The doc comment on campaign() at line 148 states "Returns stateIdle if drained, stateActive to retry on error, or stateStop." However, line 173 returns stateStop when CreateElector fails, which permanently kills the handler goroutine.

This is likely a bug, not just a stale comment: in a production zone-drain scenario, transient issues (e.g., temporary etcd connectivity problems) during elector creation would permanently stop the namespace handler with no recovery path. The handler would need a manager restart to resume.

By contrast, when the leaderCh closes unexpectedly (line 187-188), the code correctly returns stateActive to retry — confirming the intent to be resilient to transient errors.

Consider returning stateActive (with a backoff/sleep to avoid tight-looping) to match the documented and intended retry behavior, or add a bounded retry count.

Rules ✅ All requirements met

Repository Rules

PR Description Quality Standards: All required sections properly filled out with substantive content explaining the DrainSignalObserver interface (including both drain and undrain functionality), the etcd lease problem during zone drains, test commands, and appropriate N/A markings for risks/release notes

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants