Skip to content

feat(shard-distributor): add dynamic enable/disable support for spectator client#7699

Merged
jakobht merged 7 commits intocadence-workflow:masterfrom
jakobht:enableSMspectator
Feb 19, 2026
Merged

feat(shard-distributor): add dynamic enable/disable support for spectator client#7699
jakobht merged 7 commits intocadence-workflow:masterfrom
jakobht:enableSMspectator

Conversation

@jakobht
Copy link
Member

@jakobht jakobht commented Feb 12, 2026

What changed?

  • Added EnabledFunc parameter to spectator client to support dynamic enable/disable
  • Replaced channel-based firstStateCh with ResettableSignal abstraction
  • Fixed test flakiness and goroutine leaks in spectator client tests

Why?
To support migration scenarios where the spectator needs to be disabled dynamically without stopping the service. The resettable signal provides better control for managing state transitions when toggling between enabled/disabled states.

How did you test it?

  • Added unit tests for ResettableSignal
  • Fixed existing tests to properly clean up gRPC transport goroutines
  • Verified tests pass consistently with --count 10 flag

Potential risks
Low risk - adds new optional functionality without changing existing behavior when EnabledFunc is not provided.

Release notes
N/A

Documentation Changes
N/A

return
}

if !s.enabled() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: GetShardOwner blocks forever after disable/enable cycle

After a disable→enable transition, firstStateSignal.Reset() puts the signal back to "waiting" state, but shardToOwner is never cleared to nil. When the spectator re-enables and starts receiving responses, handleResponse() checks s.shardToOwner == nil to determine isFirstState (line 175). Since shardToOwner was already populated before the disable, isFirstState is always false, so firstStateSignal.Done() is never called after the re-enable.

This means any GetShardOwner call after a disable/enable cycle will block indefinitely on firstStateSignal.Wait(ctx) (line 196), only returning when the caller's context is cancelled.

Reproduction scenario:

  1. Spectator starts, receives first state → shardToOwner populated, firstStateSignal.Done() called
  2. EnabledFunc returns falsewatchLoop calls firstStateSignal.Reset()
  3. EnabledFunc returns true → stream reconnects, responses arrive
  4. handleResponse sees shardToOwner != nil → skips firstStateSignal.Done()
  5. GetShardOwner blocks forever on firstStateSignal.Wait()

Suggested fix: Clear shardToOwner when resetting the signal so the first-state detection works correctly after re-enable:

if !s.enabled() {
    s.firstStateSignal.Reset()
    s.stateMu.Lock()
    s.shardToOwner = nil
    s.stateMu.Unlock()
    // ... sleep and continue
}

Suggested fix:

		if !s.enabled() {
			// If spectator is disabled, sleep for a second and continue
			s.firstStateSignal.Reset()
			s.stateMu.Lock()
			s.shardToOwner = nil
			s.stateMu.Unlock()

			err := s.timeSource.SleepWithContext(s.ctx, backoff.JitDuration(streamRetryInterval, streamRetryJitterCoeff))
			if err != nil {
				return
			}
			continue
		}

Was this helpful? React with 👍 / 👎

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

}
// Signal that first state has been received - this function is free to call
// multiple times.
s.firstStateSignal.Done()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Spaces used instead of tab indentation on line 200

Line 200 uses spaces for indentation ( s.firstStateSignal.Done()) while the rest of the file uses tabs (as shown by ^I in surrounding lines). This would be caught and fixed by gofmt. It's a minor formatting inconsistency but breaks the Go convention enforced by standard tooling.

Suggested fix:

	s.firstStateSignal.Done()

Was this helpful? React with 👍 / 👎

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by running linter

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for running the linter and fixing this!

…r not

Signed-off-by: Jakob Haahr Taankvist <jht@uber.com>
Signed-off-by: Jakob Haahr Taankvist <jht@uber.com>
Signed-off-by: Jakob Haahr Taankvist <jht@uber.com>
Signed-off-by: Jakob Haahr Taankvist <jht@uber.com>
Signed-off-by: Jakob Haahr Taankvist <jht@uber.com>
Signed-off-by: Jakob Haahr Taankvist <jht@uber.com>
Signed-off-by: Jakob Haahr Taankvist <jht@uber.com>
return s.disabledState
}

response, err := s.stream.Recv()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: enabledState blocks on Recv(), can't detect disable until stream event

In enabledState, the enabled() check at line 139 only runs between Recv() calls. Once Recv() blocks at line 143, the spectator cannot detect that enabled() has become false until the stream produces an event or error. For idle streams during migration, this could mean the spectator remains effectively "enabled" for an unbounded period after EnabledFunc starts returning false.

This undermines the core feature — dynamic disable for migration. If the stream is idle (no shard state changes), the disable signal won't be detected promptly.

Suggested approach: Run Recv() in a separate goroutine and use a select with a ticker so the enabled check can run concurrently with the blocking Recv. Alternatively, accept and document this as a known limitation (disable takes effect on the next stream event).

Was this helpful? React with 👍 / 👎

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, however stream events are very frequent so that is fine for a config change

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Since stream events are frequent, the delay in detecting the disable signal should be negligible for configuration changes. The current implementation is acceptable for this use case.

@gitar-bot
Copy link

gitar-bot bot commented Feb 18, 2026

Code Review ✅ Approved

Well-designed state machine refactoring with a clean ResettableSignal abstraction. Previous findings (indentation, GetShardOwner blocking, Recv blocking) have been addressed or are acceptable per reviewer feedback. No new issues found.

Rules ❌ No requirements met

Repository Rules

PR Description Quality Standards: Add technical rationale in [Why?] explaining why ResettableSignal was chosen over alternatives, and provide exact copyable test commands in [How did you test it?] such as 'go test -v ./service/sharddistributor/client/spectatorclient -run TestResettableSignal'

1 rule not applicable. Show all rules by commenting gitar display:verbose.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@jakobht jakobht merged commit 391c100 into cadence-workflow:master Feb 19, 2026
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants