Skip to content

pkg/monitor,pkg/util/changefeed: fix AllIteratorsConsumed race in changefeed tests#4699

Open
ventifus wants to merge 1 commit intomasterfrom
ventifus/ARO-25354/changefeed-race
Open

pkg/monitor,pkg/util/changefeed: fix AllIteratorsConsumed race in changefeed tests#4699
ventifus wants to merge 1 commit intomasterfrom
ventifus/ARO-25354/changefeed-race

Conversation

@ventifus
Copy link
Collaborator

@ventifus ventifus commented Mar 19, 2026

Which issue this PR addresses:

Fixes https://redhat.atlassian.net/browse/ARO-25354

What this PR does / why we need it:

AllIteratorsConsumed() is not a sound cache-settled barrier: the fake iterator sets i.done = true inside Next() when returning the final page, before RunChangefeed has called OnDoc on those documents. Tests polling AllIteratorsConsumed can unblock and assert cache state while the OnDoc loop for that final batch is still in progress.

This PR replaces AllIteratorsConsumed barriers in worker_test.go and subscriptioncache_test.go with GetLastProcessed() polls. GetLastProcessed() is advanced by OnAllPendingProcessed(), which RunChangefeed only fires after all OnDoc calls for a sweep have completed — making it a sound post-sweep barrier. WaitForInitialPopulation() is already sound and is left unchanged.

Test plan for issue:

go test -count=50 -race ./pkg/monitor/... ./pkg/util/changefeed/... passes without failure.

Is there any documentation that needs to be updated for this PR?

No — CI-only flake fix, no user-facing or operational changes.

How do you know this will function as expected in production?

This change is test-only. No production code is modified. The barrier being replaced (AllIteratorsConsumed) is a test helper that exists only in fake CosmosDB clients; GetLastProcessed() already exists on the SubscriptionsCache interface in production code.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR stabilizes flaky changefeed-related tests by replacing AllIteratorsConsumed()-based synchronization with barriers derived from GetLastProcessed(), which is advanced only after OnAllPendingProcessed() completes a sweep.

Changes:

  • Update TestSubscriptionChangefeed to stop polling AllIteratorsConsumed() and instead wait for cache.GetLastProcessed() to advance.
  • Update TestChangefeedOperations to wait on monitor/cache state rather than iterator consumption, reducing race sensitivity.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
pkg/util/changefeed/subscriptioncache_test.go Replaces iterator-consumption waits with GetLastProcessed() advancement checks to ensure cache population has completed.
pkg/monitor/worker_test.go Replaces iterator-consumption waits with assertions on observed docs and GetLastProcessed() advancement.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 19, 2026 20:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ventifus ventifus force-pushed the ventifus/ARO-25354/changefeed-race branch from 0e18730 to a6b8e99 Compare March 19, 2026 20:53
Copilot AI review requested due to automatic review settings March 19, 2026 20:53
@ventifus ventifus force-pushed the ventifus/ARO-25354/changefeed-race branch from a6b8e99 to d353cee Compare March 19, 2026 20:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ventifus ventifus force-pushed the ventifus/ARO-25354/changefeed-race branch from d353cee to 90836ca Compare March 19, 2026 21:19
@ventifus
Copy link
Collaborator Author

ventifus commented Mar 19, 2026

Pre-existing races found during make test-go-race

This PR adds make test-go-race (go test -count=50 -race ./...) as a convenience target for codebase-wide race detection. Running it surfaces four pre-existing races in the fake CosmosDB client infrastructure — none introduced by this PR; all present on master.

The fix this PR delivers (replacing AllIteratorsConsumed barriers with GetLastProcessed() two-advancement polls) was verified separately with:

go test -count=100 -run 'TestSubscriptionChangefeed|TestChangefeedOperations' \
    ./pkg/monitor/... ./pkg/util/changefeed/...

100 iterations, all pass. The -race flag is omitted from this command because it surfaces the unrelated pre-existing races below, causing spurious failures unrelated to the barrier fix.


Pre-existing races (unrelated to this PR)

1. zz_generated_*_fake.go — iterator/updateChangeFeeds locking

updateChangeFeeds() appends to currentIterator.openShiftClusterDocuments (and the subscription equivalent) while holding c.lock (write lock). Concurrently, Next() reads those same fields without holding c.lock. The RunChangefeed goroutine calls Next() while the test goroutine calls DB.Update()apply()updateChangeFeeds().

2. zz_generated_*_fake.goChangeFeed() mutates under RLock

ChangeFeed() holds c.lock.RLock() but then writes c.changeFeedIterators = append(c.changeFeedIterators, newIter). Two concurrent ChangeFeed() calls both hold RLock and both append, racing on the slice header. Fix: change to Lock()/Unlock().

3. worker.gochangefeedMetrics() reads mon.docs without mon.mu

changefeedMetrics() calls len(mon.docs) without holding mon.mu, while upsertDoc()/deleteDoc() write to mon.docs under mon.mu.Lock(). Fix: wrap with mon.mu.RLock()/RUnlock().

4. test_helpers.gofakeMonitor.Monitor() counter increment

fakeMonitor.Monitor() does *fm.clusterCounter++ unsynchronized. Multiple worker goroutines launched by execute() can call Monitor() concurrently on instances sharing the same *int pointer. Fix: use atomic.Int64.

Replace AllIteratorsConsumed()-based synchronization in changefeed
tests with barriers derived from GetLastProcessed(), which advances
only after OnAllPendingProcessed() completes a full sweep. This is
the correct settled-cache barrier.

Key changes:
- pkg/util/changefeed/subscriptioncache_test.go: wait for two
  GetLastProcessed() advancements after each mutation group to
  guarantee a complete post-mutation sweep has run
- pkg/monitor/worker_test.go: replace vacuous len(docs) barrier with
  lastClusterChangefeed timestamp barrier; same two-advancement
  pattern for both cluster and subscription caches
- pkg/database/cosmosdb: add per-iterator sync.Mutex to
  fakeSubscriptionDocumentIterator and fakeOpenShiftClusterDocumentIterator;
  fix ChangeFeed() to hold a write lock (not read lock) while
  appending to changeFeedIterators
- pkg/database/cosmosdb: delete AllIteratorsConsumed() extension
  methods (no callers remain)
- pkg/monitor/test_helpers.go: remove unused fake client fields from
  TestEnvironment
- Makefile: add -race flag to unit-test-go target

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ventifus ventifus force-pushed the ventifus/ARO-25354/changefeed-race branch from 90836ca to 7299caf Compare March 19, 2026 22:36
Copilot AI review requested due to automatic review settings March 19, 2026 22:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@cadenmarchese cadenmarchese added the chainsaw Pull requests or issues owned by Team Chainsaw label Mar 20, 2026
@github-actions
Copy link

Please rebase pull request.

@github-actions github-actions bot added the needs-rebase branch needs a rebase label Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chainsaw Pull requests or issues owned by Team Chainsaw needs-rebase branch needs a rebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants