Skip to content

Conversation

hugehoo
Copy link
Contributor

@hugehoo hugehoo commented Aug 18, 2025

Fixes: #8435
RELEASE NOTES: N/A

root cause of issue:

  • I think there was a race condition when channel communicates between the xDS resolver and test infrastructure
    • insufficient buffer size: original channels (stateCh and errCh) had only buffer size of 1
    • blocking sends: When buffer is full, the resolver would block trying to send the next update
    • test deadlock: test infra might be waiting for a specific update while the resolver was blocked trying to send a different update, creating a deadlock

Changes

  1. Increased buffer size (1 → 10):
  stateCh := make(chan resolver.State, 10)
  errCh := make(chan error, 10)
  1. Non-blocking send pattern:
 select {
 case stateCh <- s:  // the resolver try to send updates
 default:            // If channel is full, drain old message and retry
     select {
     case <-stateCh:
         stateCh <- s
     default:
     }
 }
  • make it drain old messages preventing the resolver from blocking and just keeping the most latest updates.
  1. Cleanup with draining goroutines:
  go func() {
      for range stateCh { }  // Drain any remaining messages
  }()
  • it ensures the resolver never blocks on sends and prevents goroutine leaks during test cleanup.

Copy link

codecov bot commented Aug 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.88%. Comparing base (9ac0ec8) to head (9352248).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8521      +/-   ##
==========================================
+ Coverage   81.82%   81.88%   +0.06%     
==========================================
  Files         413      413              
  Lines       40518    40518              
==========================================
+ Hits        33153    33179      +26     
+ Misses       5989     5974      -15     
+ Partials     1376     1365      -11     

see 32 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hugehoo hugehoo force-pushed the flaky-test-nackedWithoutCache branch from ef9f9cb to 9352248 Compare August 18, 2025 16:29
@hugehoo hugehoo marked this pull request as ready for review August 18, 2025 16:37
@arjan-bal
Copy link
Contributor

Hi @hugehoo I have a few questions/requests to help me reviewing this fix:

  1. Can you describe the root cause in the linked issue?
  2. Can you also explain the fix in the PR description?
  3. Were you able to repro the flakiness? If yes, can you mention the go test command that was used?

@arjan-bal arjan-bal self-requested a review August 21, 2025 06:56
@arjan-bal arjan-bal added this to the 1.76 Release milestone Aug 21, 2025
@hugehoo
Copy link
Contributor Author

hugehoo commented Aug 23, 2025

@arjan-bal i updated PR comment as you mentioned for 1, 2. but can't reproduce the flakiness yet.

@arjan-bal arjan-bal assigned arjan-bal and unassigned hugehoo Aug 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test: Test/ResolverBadServiceUpdate_NACKedWithoutCache
2 participants