Fix flaky ResourceAwareTasksTests#20863
Conversation
Race condition between request completion and task resource tracking cleanup. The sequence of events: 1. Task is cancelled via `CancelTasksRequest` 2. The node operation throws `TaskCancelledException` 3. The response is sent back to the caller, which counts down `requestCompleteLatch` 4. The test's main thread wakes up from `requestCompleteLatch.await()` and asserts `resourceTasks.size() == 0` 5. Meanwhile, `TaskResourceTrackingService.stopTracking()` (which calls `resourceAwareTasks.remove()`) is invoked asynchronously via a `resourceTrackingCompletionListener` registered in `TaskManager.register()` Steps 4 and 5 race. I was able to reproduce the failure locally using `stess-ng` and verify this fix. Signed-off-by: Andrew Ross <andrross@amazon.com>
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #20863 +/- ##
============================================
- Coverage 73.31% 73.24% -0.07%
- Complexity 72247 72276 +29
============================================
Files 5796 5796
Lines 330224 330256 +32
Branches 47661 47663 +2
============================================
- Hits 242090 241898 -192
- Misses 68693 68965 +272
+ Partials 19441 19393 -48 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Race condition between request completion and task resource tracking cleanup. The sequence of events: 1. Task is cancelled via `CancelTasksRequest` 2. The node operation throws `TaskCancelledException` 3. The response is sent back to the caller, which counts down `requestCompleteLatch` 4. The test's main thread wakes up from `requestCompleteLatch.await()` and asserts `resourceTasks.size() == 0` 5. Meanwhile, `TaskResourceTrackingService.stopTracking()` (which calls `resourceAwareTasks.remove()`) is invoked asynchronously via a `resourceTrackingCompletionListener` registered in `TaskManager.register()` Steps 4 and 5 race. I was able to reproduce the failure locally using `stess-ng` and verify this fix. Signed-off-by: Andrew Ross <andrross@amazon.com>
Race condition between request completion and task resource tracking cleanup.
The sequence of events:
CancelTasksRequestTaskCancelledExceptionrequestCompleteLatchrequestCompleteLatch.await()and assertsresourceTasks.size() == 0TaskResourceTrackingService.stopTracking()(which callsresourceAwareTasks.remove()) is invoked asynchronously via aresourceTrackingCompletionListenerregistered inTaskManager.register()Steps 4 and 5 race. I was able to reproduce the failure locally using
stess-ngand verify this fix.Related Issues
Resolves #14293
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.