Skip to content

Fix flaky ResourceAwareTasksTests#20863

Merged
andrross merged 1 commit intoopensearch-project:mainfrom
andrross:flaky-ResourceAwareTasksTests
Mar 13, 2026
Merged

Fix flaky ResourceAwareTasksTests#20863
andrross merged 1 commit intoopensearch-project:mainfrom
andrross:flaky-ResourceAwareTasksTests

Conversation

@andrross
Copy link
Member

Race condition between request completion and task resource tracking cleanup.

The sequence of events:

  1. Task is cancelled via CancelTasksRequest
  2. The node operation throws TaskCancelledException
  3. The response is sent back to the caller, which counts down requestCompleteLatch
  4. The test's main thread wakes up from requestCompleteLatch.await() and asserts resourceTasks.size() == 0
  5. Meanwhile, TaskResourceTrackingService.stopTracking() (which calls resourceAwareTasks.remove()) is invoked asynchronously via a resourceTrackingCompletionListener registered in TaskManager.register()

Steps 4 and 5 race. I was able to reproduce the failure locally using stess-ng and verify this fix.

Related Issues

Resolves #14293

Check List

  • Functionality includes testing.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Race condition between request completion and task resource tracking
cleanup.

The sequence of events:
1. Task is cancelled via `CancelTasksRequest`
2. The node operation throws `TaskCancelledException`
3. The response is sent back to the caller, which counts down
   `requestCompleteLatch`
4. The test's main thread wakes up from `requestCompleteLatch.await()`
   and asserts `resourceTasks.size() == 0`
5. Meanwhile, `TaskResourceTrackingService.stopTracking()` (which
   calls `resourceAwareTasks.remove()`) is invoked asynchronously
   via a `resourceTrackingCompletionListener` registered in
   `TaskManager.register()`

Steps 4 and 5 race. I was able to reproduce the failure locally using
`stess-ng` and verify this fix.

Signed-off-by: Andrew Ross <andrross@amazon.com>
@andrross andrross requested a review from a team as a code owner March 13, 2026 15:04
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut Cluster Manager flaky-test Random test failure that succeeds on second run labels Mar 13, 2026
@github-actions
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Assertion Order

The assertBusy call is placed before assertNull(throwableReference.get()) and assertNotNull(responseReference.get()). Since assertBusy polls until the condition is met or times out, it could mask failures or delay detection of other assertion failures. Consider whether the ordering is intentional or if the busy assertion should come after the other assertions.

assertBusy(() -> assertEquals(0, resourceTasks.size()));
assertNull(throwableReference.get());
assertNotNull(responseReference.get());
assertEquals(1, responseReference.get().failureCount());
assertEquals(TaskCancelledException.class, findActualException(responseReference.get().failures().get(0)).getClass());

@github-actions
Copy link
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Ensure assertion errors propagate correctly in async check

The assertBusy call uses a lambda that wraps an assertion, but assertEquals inside a
lambda passed to assertBusy may not propagate AssertionError correctly depending on
the framework's implementation. Consider using a proper assertion that throws an
exception on failure, or ensure the lambda throws an AssertionError when the
condition is not met, such as using assertTrue or an explicit check with a thrown
exception.

server/src/test/java/org/opensearch/action/admin/cluster/node/tasks/ResourceAwareTasksTests.java [415]

-assertBusy(() -> assertEquals(0, resourceTasks.size()));
+assertBusy(() -> assertTrue("Expected resourceTasks to be empty but size was: " + resourceTasks.size(), resourceTasks.isEmpty()));
Suggestion importance[1-10]: 2

__

Why: The assertBusy method in OpenSearch's test framework is designed to catch AssertionError thrown by assertions like assertEquals, so the concern about error propagation is not valid. Both assertEquals and assertTrue throw AssertionError on failure, making this change functionally equivalent with no real improvement.

Low

@github-actions
Copy link
Contributor

✅ Gradle check result for b2cff2c: SUCCESS

@codecov
Copy link

codecov bot commented Mar 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.24%. Comparing base (564cbee) to head (b2cff2c).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20863      +/-   ##
============================================
- Coverage     73.31%   73.24%   -0.07%     
- Complexity    72247    72276      +29     
============================================
  Files          5796     5796              
  Lines        330224   330256      +32     
  Branches      47661    47663       +2     
============================================
- Hits         242090   241898     -192     
- Misses        68693    68965     +272     
+ Partials      19441    19393      -48     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andrross andrross merged commit a91ae9d into opensearch-project:main Mar 13, 2026
42 of 48 checks passed
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Cluster Manager Project Board Mar 13, 2026
@andrross andrross deleted the flaky-ResourceAwareTasksTests branch March 13, 2026 18:06
shayush622 pushed a commit to shayush622/OpenSearch that referenced this pull request Mar 16, 2026
Race condition between request completion and task resource tracking
cleanup.

The sequence of events:
1. Task is cancelled via `CancelTasksRequest`
2. The node operation throws `TaskCancelledException`
3. The response is sent back to the caller, which counts down
   `requestCompleteLatch`
4. The test's main thread wakes up from `requestCompleteLatch.await()`
   and asserts `resourceTasks.size() == 0`
5. Meanwhile, `TaskResourceTrackingService.stopTracking()` (which
   calls `resourceAwareTasks.remove()`) is invoked asynchronously
   via a `resourceTrackingCompletionListener` registered in
   `TaskManager.register()`

Steps 4 and 5 race. I was able to reproduce the failure locally using
`stess-ng` and verify this fix.

Signed-off-by: Andrew Ross <andrross@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut Cluster Manager flaky-test Random test failure that succeeds on second run skip-changelog >test-failure Test failure from CI, local build, etc.

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for ResourceAwareTasksTests

3 participants