Skip to content

fix(shard-manager): Flaky spectator client unit test#7734

Merged
gazi-yestemirova merged 5 commits intocadence-workflow:masterfrom
gazi-yestemirova:fix_spectator_test
Feb 23, 2026
Merged

fix(shard-manager): Flaky spectator client unit test#7734
gazi-yestemirova merged 5 commits intocadence-workflow:masterfrom
gazi-yestemirova:fix_spectator_test

Conversation

@gazi-yestemirova
Copy link
Contributor

@gazi-yestemirova gazi-yestemirova commented Feb 23, 2026

What changed?
This PR updates the flaky spectatorclient TestWatchLoopDisabled unit test. It is fixed by removing the racy assertion on Wait() and instead we verify that the disabled loop sleeps periodically and shuts down cleanly.

Why?
Why the test was flaky:

  1. spectator.Start() launches the watchLoop goroutine, which enters disabledState
  2. disabledState calls s.firstStateSignal.Reset() once at the top, then enters the sleep loop
  3. The test goroutine calls stateSignal.Wait(context.Background())
    The race was between steps 2 and 3:
  • If Wait() captures resetCh before Reset(): Reset() closes the old resetCh, Wait() returns ErrReset and test passes.
  • If Wait() captures channels after Reset(): it gets the new resetCh (still open). Since Reset() is only called once at the top of disabledState (not in the loop), this Wait() blocks until spectator.Stop() calls Done(), returning nil. The assert.Error on line 285 fails.

How did you test it?
Unit tests with - go test -v ./service/sharddistributor/client/spectatorclient

Potential risks
Low risk since this is a unit test fix

Release notes
N/A

Documentation Changes
N/A


Reviewer Validation

PR Description Quality (check these before reviewing code):

  • "What changed" provides a clear 1-2 line summary
    • Project Issue is linked
  • "Why" explains the full motivation with sufficient context
  • Testing is documented:
    • Unit test commands are included (with exact go test invocation)
    • Integration test setup/commands included (if integration tests were run)
    • Canary testing details included (if canary was mentioned)
  • Potential risks section is thoughtfully filled out (or legitimately N/A)
  • Release notes included if this completes a user-facing feature
  • Documentation needs are addressed (or noted if uncertain)

Signed-off-by: Gaziza Yestemirova <gaziza@uber.com>
@gitar-bot
Copy link

gitar-bot bot commented Feb 23, 2026

🔍 CI failure analysis for 9d22159: The codecov CI job was terminated after exceeding the maximum execution time while running tests, which is unrelated to the PR changes.

Issue

The codecov CI job failed with exit code 143, indicating the process was terminated (SIGTERM).

Root Cause

The job was running the full test suite with coverage (make cover_profile) and was terminated after running for over 16 minutes (from 10:18:15 to 10:34:55). The last successful test output was:

ok  	github.com/uber/cadence/service/matching/config	1.039s	coverage: 100.0% of statements
make: *** [Makefile:696: cover_profile] Terminated
##[error]Process completed with exit code 143.

Exit code 143 typically indicates the process was killed by CI timeout limits or resource constraints.

Details

This failure is unrelated to the PR changes. The PR only modifies a single test file:

  • service/sharddistributor/client/spectatorclient/clientimpl_test.go

The test for this package would have run early in the test suite (alphabetically before service/matching/config where it terminated). The termination occurred much later while testing an unrelated package, suggesting either:

  1. A CI infrastructure timeout (job exceeded maximum allowed runtime)
  2. Resource exhaustion on the CI runner
  3. A pre-existing flaky test or performance issue in the broader test suite

This is a transient infrastructure issue, not a problem introduced by this PR.

Code Review ✅ Approved

Clean fix for a flaky test - removes the race condition by eliminating the concurrent goroutine and replacing it with deterministic assertions. No issues found.

Rules ✅ All requirements met

Repository Rules

PR Description Quality Standards: PR description includes comprehensive Intent & Rationale, Implementation Analysis, Codebase Context, and Risk Assessment sections with technical details

1 rule not applicable. Show all rules by commenting gitar display:verbose.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@gazi-yestemirova gazi-yestemirova merged commit 29e2101 into cadence-workflow:master Feb 23, 2026
43 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants