feat(shard-distributor): add dynamic enable/disable support for spectator client#7699
Conversation
df16e85 to
44a877b
Compare
| return | ||
| } | ||
|
|
||
| if !s.enabled() { |
There was a problem hiding this comment.
⚠️ Bug: GetShardOwner blocks forever after disable/enable cycle
After a disable→enable transition, firstStateSignal.Reset() puts the signal back to "waiting" state, but shardToOwner is never cleared to nil. When the spectator re-enables and starts receiving responses, handleResponse() checks s.shardToOwner == nil to determine isFirstState (line 175). Since shardToOwner was already populated before the disable, isFirstState is always false, so firstStateSignal.Done() is never called after the re-enable.
This means any GetShardOwner call after a disable/enable cycle will block indefinitely on firstStateSignal.Wait(ctx) (line 196), only returning when the caller's context is cancelled.
Reproduction scenario:
- Spectator starts, receives first state →
shardToOwnerpopulated,firstStateSignal.Done()called EnabledFuncreturnsfalse→watchLoopcallsfirstStateSignal.Reset()EnabledFuncreturnstrue→ stream reconnects, responses arrivehandleResponseseesshardToOwner != nil→ skipsfirstStateSignal.Done()GetShardOwnerblocks forever onfirstStateSignal.Wait()
Suggested fix: Clear shardToOwner when resetting the signal so the first-state detection works correctly after re-enable:
if !s.enabled() {
s.firstStateSignal.Reset()
s.stateMu.Lock()
s.shardToOwner = nil
s.stateMu.Unlock()
// ... sleep and continue
}Suggested fix:
if !s.enabled() {
// If spectator is disabled, sleep for a second and continue
s.firstStateSignal.Reset()
s.stateMu.Lock()
s.shardToOwner = nil
s.stateMu.Unlock()
err := s.timeSource.SleepWithContext(s.ctx, backoff.JitDuration(streamRetryInterval, streamRetryJitterCoeff))
if err != nil {
return
}
continue
}
Was this helpful? React with 👍 / 👎
| } | ||
| // Signal that first state has been received - this function is free to call | ||
| // multiple times. | ||
| s.firstStateSignal.Done() |
There was a problem hiding this comment.
💡 Quality: Spaces used instead of tab indentation on line 200
Line 200 uses spaces for indentation ( s.firstStateSignal.Done()) while the rest of the file uses tabs (as shown by ^I in surrounding lines). This would be caught and fixed by gofmt. It's a minor formatting inconsistency but breaks the Go convention enforced by standard tooling.
Suggested fix:
s.firstStateSignal.Done()
Was this helpful? React with 👍 / 👎
There was a problem hiding this comment.
Great, thanks for running the linter and fixing this!
…r not Signed-off-by: Jakob Haahr Taankvist <jht@uber.com>
Signed-off-by: Jakob Haahr Taankvist <jht@uber.com>
Signed-off-by: Jakob Haahr Taankvist <jht@uber.com>
Signed-off-by: Jakob Haahr Taankvist <jht@uber.com>
Signed-off-by: Jakob Haahr Taankvist <jht@uber.com>
53065af to
f7b980b
Compare
Signed-off-by: Jakob Haahr Taankvist <jht@uber.com>
| return s.disabledState | ||
| } | ||
|
|
||
| response, err := s.stream.Recv() |
There was a problem hiding this comment.
⚠️ Bug: enabledState blocks on Recv(), can't detect disable until stream event
In enabledState, the enabled() check at line 139 only runs between Recv() calls. Once Recv() blocks at line 143, the spectator cannot detect that enabled() has become false until the stream produces an event or error. For idle streams during migration, this could mean the spectator remains effectively "enabled" for an unbounded period after EnabledFunc starts returning false.
This undermines the core feature — dynamic disable for migration. If the stream is idle (no shard state changes), the disable signal won't be detected promptly.
Suggested approach: Run Recv() in a separate goroutine and use a select with a ticker so the enabled check can run concurrently with the blocking Recv. Alternatively, accept and document this as a known limitation (disable takes effect on the next stream event).
Was this helpful? React with 👍 / 👎
There was a problem hiding this comment.
This is true, however stream events are very frequent so that is fine for a config change
There was a problem hiding this comment.
Understood. Since stream events are frequent, the delay in detecting the disable signal should be negligible for configuration changes. The current implementation is acceptable for this use case.
Code Review ✅ ApprovedWell-designed state machine refactoring with a clean ResettableSignal abstraction. Previous findings (indentation, GetShardOwner blocking, Recv blocking) have been addressed or are acceptable per reviewer feedback. No new issues found. Rules ❌ No requirements metRepository Rules
1 rule not applicable. Show all rules by commenting OptionsAuto-apply is off → Gitar will not commit updates to this branch. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
What changed?
EnabledFuncparameter to spectator client to support dynamic enable/disablefirstStateChwithResettableSignalabstractionWhy?
To support migration scenarios where the spectator needs to be disabled dynamically without stopping the service. The resettable signal provides better control for managing state transitions when toggling between enabled/disabled states.
How did you test it?
ResettableSignal--count 10flagPotential risks
Low risk - adds new optional functionality without changing existing behavior when
EnabledFuncis not provided.Release notes
N/A
Documentation Changes
N/A