feat(shard-manager): Add support for watching drains by gazi-yestemirova · Pull Request #7697 · cadence-workflow/cadence

gazi-yestemirova · 2026-02-11T16:53:42Z

What changed?
This PR introduces DrainSignalObserver interface in clientcommon to allow shard-distributor components to react to infrastructure drain signals.
DrainSignalObserver is a simple interface that allows deployment-specific implementations to signal when this instance has been removed from or added back to service discovery. The leader namespace manager subscribes to drain and signal to proactively resign from etcd elections, it also listens to undrain signal to resume back leadership operations to campain again for the namespace.

Why?
The shard-distributor leader holds an etcd lease to coordinate shard assignments across all executors. In production environments, infrastructure operations (e.g. host drains) can remove a service instance from service discovery while the process continues running. Without active detection, the leader in a drained zone continues holding its etcd lease and operating normally - unaware that it is no longer reachable by other components.

How did you test it?
Added unit tests and checked with go test -v ./service/sharddistributor/leader/namespace

Potential risks
NA

Release notes
NA

Documentation Changes
NA

Reviewer Validation

PR Description Quality (check these before reviewing code):

Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>

jakobht

As far as I understand this will only allow us to drain the leader. If we later want to undrain we have to restart the service.

Did I miss something?

jakobht · 2026-02-12T09:42:50Z

service/sharddistributor/client/clientcommon/drain_observer.go

+//go:generate mockgen -package $GOPACKAGE -source $GOFILE -destination drain_observer_mock.go . DrainSignalObserver
+
+// DrainSignalObserver observes infra drain signals
+// When a drain is detected, ShouldStop() returns a closed channel.


This comment is confusing to me.

I guess Should Stop returns a channel that will be closed in case we need to drain. Maybe we could call it

DrainChan()

gazi-yestemirova · 2026-02-12T10:01:03Z

As far as I understand this will only allow us to drain the leader. If we later want to undrain we have to restart the service.

Did I miss something?

that is correct yes, currently it is one-way behaviour. Do we also want to support auto-recovery? then I think we need have Active and Draining states within the interface

Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>

gitar-bot · 2026-02-16T11:22:14Z

🔍 CI failure analysis for 9c03778: Integration test failed due to replication connectivity timeout during test cleanup - completely unrelated to PR changes in shard-distributor namespace manager

Issue

The "Golang integration test with sqlite" job failed after running for ~680 seconds with a replication error during test cleanup.

Root Cause

This is unrelated to the PR changes. The failure is a replication/infrastructure issue, not caused by the DrainSignalObserver implementation.

Evidence:

Changed files are unrelated: PR only modifies shard-distributor namespace manager files:
- service/sharddistributor/client/clientcommon/drain_observer.go
- service/sharddistributor/client/clientcommon/drain_observer_mock.go
- service/sharddistributor/leader/namespace/manager.go
- service/sharddistributor/leader/namespace/manager_test.go
Error is in replication layer: The failure occurs in domain_replication_processor.go - the replicator component trying to fetch replication tasks from a standby cluster

Error message indicates infrastructure issue:

Failed to get replication tasks: "round-robin" peer list has 1 peer but it is not responsive, 
timed out waiting for a connection to open: context canceled

Timing suggests test cleanup: Error occurs at 2026-02-16T11:45:57 during test teardown when contexts are being canceled and task lists are stopping

Details

The integration test suite ran for ~11 minutes (680s) and was shutting down when the replication processor failed to connect to a standby cluster. This is a network connectivity/test infrastructure issue during cleanup, completely unrelated to the shard-distributor drain observer functionality.

The drain observer changes affect leadership election behavior in the shard-distributor service, which has no connection to cross-cluster replication task processing.

Code Review 👍 Approved with suggestions 1 resolved / 2 findings

Clean state machine refactor with well-designed drain/undrain observer pattern. Previous finding about comment/code mismatch is resolved; the minor concern about no backoff on retry loop remains unresolved.

💡 Edge Case: No backoff on retry could cause tight spin loop

📄 service/sharddistributor/leader/namespace/manager.go:185-189

When campaign() returns stateActive to retry (e.g., line 188 on unexpected channel close), the state machine immediately re-enters campaign() with no backoff delay. If the elector creation or the election channel consistently fails, this creates a hot spin loop that will flood logs and consume CPU.

Consider adding a small backoff (e.g., time.Sleep with exponential backoff or a fixed delay like 1-5 seconds) before retrying, or use a time.After in the state machine loop. This is especially relevant if the CreateElector error (line 173) is changed to stateActive per the other finding.

✅ 1 resolved

✅ Bug: Comment says retry on error but code stops permanently

📄 service/sharddistributor/leader/namespace/manager.go:170-174
The doc comment on campaign() at line 148 states "Returns stateIdle if drained, stateActive to retry on error, or stateStop." However, line 173 returns stateStop when CreateElector fails, which permanently kills the handler goroutine.

This is likely a bug, not just a stale comment: in a production zone-drain scenario, transient issues (e.g., temporary etcd connectivity problems) during elector creation would permanently stop the namespace handler with no recovery path. The handler would need a manager restart to resume.

By contrast, when the leaderCh closes unexpectedly (line 187-188), the code correctly returns stateActive to retry — confirming the intent to be resilient to transient errors.

Consider returning stateActive (with a backoff/sleep to avoid tight-looping) to match the documented and intended retry behavior, or add a bounded retry count.

Rules ✅ All requirements met

Repository Rules

✅ PR Description Quality Standards: All required sections properly filled out with substantive content explaining the DrainSignalObserver interface (including both drain and undrain functionality), the etcd lease problem during zone drains, test commands, and appropriate N/A markings for risks/release notes

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

Support zone drain

16a6759

Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>

gazi-yestemirova added 2 commits February 11, 2026 18:44

Remove flipr flag

7236a69

Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>

Add drain observer

c446166

Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>

gazi-yestemirova changed the title ~~Support zone drain~~ feat(shard-manager): Add support for watching drains Feb 12, 2026

jakobht reviewed Feb 12, 2026

View reviewed changes

gazi-yestemirova added 2 commits February 13, 2026 14:46

Add undrain logic

30a3f51

Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>

update states

9c03778

Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>

cadence-workflow deleted a comment from gitar-bot bot Feb 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(shard-manager): Add support for watching drains#7697

feat(shard-manager): Add support for watching drains#7697
gazi-yestemirova wants to merge 5 commits intocadence-workflow:masterfrom
gazi-yestemirova:udg_drain

gazi-yestemirova commented Feb 11, 2026 •

edited

Loading

Uh oh!

jakobht left a comment

Uh oh!

jakobht Feb 12, 2026

Uh oh!

gazi-yestemirova commented Feb 12, 2026

Uh oh!

gitar-bot bot commented Feb 16, 2026 •

edited

Loading

Issue

Root Cause

Details

Repository Rules

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gazi-yestemirova commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer Validation

Uh oh!

jakobht left a comment

Choose a reason for hiding this comment

Uh oh!

jakobht Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

gazi-yestemirova commented Feb 12, 2026

Uh oh!

gitar-bot bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Root Cause

Details

Repository Rules

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gazi-yestemirova commented Feb 11, 2026 •

edited

Loading

gitar-bot bot commented Feb 16, 2026 •

edited

Loading