Fix flaky ResourceAwareTasksTests by andrross · Pull Request #20863 · opensearch-project/OpenSearch

andrross · 2026-03-13T15:04:54Z

Race condition between request completion and task resource tracking cleanup.

The sequence of events:

Task is cancelled via CancelTasksRequest
The node operation throws TaskCancelledException
The response is sent back to the caller, which counts down requestCompleteLatch
The test's main thread wakes up from requestCompleteLatch.await() and asserts resourceTasks.size() == 0
Meanwhile, TaskResourceTrackingService.stopTracking() (which calls resourceAwareTasks.remove()) is invoked asynchronously via a resourceTrackingCompletionListener registered in TaskManager.register()

Steps 4 and 5 race. I was able to reproduce the failure locally using stess-ng and verify this fix.

Related Issues

Resolves #14293

Check List

Functionality includes testing.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Race condition between request completion and task resource tracking cleanup. The sequence of events: 1. Task is cancelled via `CancelTasksRequest` 2. The node operation throws `TaskCancelledException` 3. The response is sent back to the caller, which counts down `requestCompleteLatch` 4. The test's main thread wakes up from `requestCompleteLatch.await()` and asserts `resourceTasks.size() == 0` 5. Meanwhile, `TaskResourceTrackingService.stopTracking()` (which calls `resourceAwareTasks.remove()`) is invoked asynchronously via a `resourceTrackingCompletionListener` registered in `TaskManager.register()` Steps 4 and 5 race. I was able to reproduce the failure locally using `stess-ng` and verify this fix. Signed-off-by: Andrew Ross <andrross@amazon.com>

github-actions · 2026-03-13T15:06:08Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Assertion Order The `assertBusy` call is placed before `assertNull(throwableReference.get())` and `assertNotNull(responseReference.get())`. Since `assertBusy` polls until the condition is met or times out, it could mask failures or delay detection of other assertion failures. Consider whether the ordering is intentional or if the busy assertion should come after the other assertions. assertBusy(() -> assertEquals(0, resourceTasks.size())); assertNull(throwableReference.get()); assertNotNull(responseReference.get()); assertEquals(1, responseReference.get().failureCount()); assertEquals(TaskCancelledException.class, findActualException(responseReference.get().failures().get(0)).getClass());

github-actions · 2026-03-13T15:06:25Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Ensure assertion errors propagate correctly in async check The `assertBusy` call uses a lambda that wraps an assertion, but `assertEquals` inside a lambda passed to `assertBusy` may not propagate `AssertionError` correctly depending on the framework's implementation. Consider using a proper assertion that throws an exception on failure, or ensure the lambda throws an `AssertionError` when the condition is not met, such as using `assertTrue` or an explicit check with a thrown exception. server/src/test/java/org/opensearch/action/admin/cluster/node/tasks/ResourceAwareTasksTests.java [415] -assertBusy(() -> assertEquals(0, resourceTasks.size())); +assertBusy(() -> assertTrue("Expected resourceTasks to be empty but size was: " + resourceTasks.size(), resourceTasks.isEmpty())); Suggestion importance[1-10]: 2 __ Why: The `assertBusy` method in OpenSearch's test framework is designed to catch `AssertionError` thrown by assertions like `assertEquals`, so the concern about error propagation is not valid. Both `assertEquals` and `assertTrue` throw `AssertionError` on failure, making this change functionally equivalent with no real improvement.	Low

github-actions · 2026-03-13T16:34:25Z

✅ Gradle check result for b2cff2c: SUCCESS

codecov · 2026-03-13T16:35:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.24%. Comparing base (564cbee) to head (b2cff2c).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #20863      +/-   ##
============================================
- Coverage     73.31%   73.24%   -0.07%     
- Complexity    72247    72276      +29     
============================================
  Files          5796     5796              
  Lines        330224   330256      +32     
  Branches      47661    47663       +2     
============================================
- Hits         242090   241898     -192     
- Misses        68693    68965     +272     
+ Partials      19441    19393      -48

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Race condition between request completion and task resource tracking cleanup. The sequence of events: 1. Task is cancelled via `CancelTasksRequest` 2. The node operation throws `TaskCancelledException` 3. The response is sent back to the caller, which counts down `requestCompleteLatch` 4. The test's main thread wakes up from `requestCompleteLatch.await()` and asserts `resourceTasks.size() == 0` 5. Meanwhile, `TaskResourceTrackingService.stopTracking()` (which calls `resourceAwareTasks.remove()`) is invoked asynchronously via a `resourceTrackingCompletionListener` registered in `TaskManager.register()` Steps 4 and 5 race. I was able to reproduce the failure locally using `stess-ng` and verify this fix. Signed-off-by: Andrew Ross <andrross@amazon.com>

andrross requested a review from a team as a code owner March 13, 2026 15:04

andrross added the skip-changelog label Mar 13, 2026

github-actions bot added >test-failure Test failure from CI, local build, etc. autocut Cluster Manager flaky-test Random test failure that succeeds on second run labels Mar 13, 2026

github-project-automation bot added this to Cluster Manager Project Board Mar 13, 2026

cwperks approved these changes Mar 13, 2026

View reviewed changes

github-project-automation bot moved this to 👀 In review in Cluster Manager Project Board Mar 13, 2026

mch2 approved these changes Mar 13, 2026

View reviewed changes

andrross merged commit a91ae9d into opensearch-project:main Mar 13, 2026
42 of 48 checks passed

github-project-automation bot moved this from 👀 In review to ✅ Done in Cluster Manager Project Board Mar 13, 2026

andrross deleted the flaky-ResourceAwareTasksTests branch March 13, 2026 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky ResourceAwareTasksTests#20863

Fix flaky ResourceAwareTasksTests#20863
andrross merged 1 commit intoopensearch-project:mainfrom
andrross:flaky-ResourceAwareTasksTests

andrross commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

codecov bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andrross commented Mar 13, 2026

Related Issues

Check List

Uh oh!

github-actions bot commented Mar 13, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions bot commented Mar 13, 2026

PR Code Suggestions ✨

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

codecov bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Mar 13, 2026 •

edited

Loading