Skip to content

Fix race condition in TransferManager async cache ref counting#20918

Merged
andrross merged 1 commit intoopensearch-project:mainfrom
andrross:TransferManager-race
Mar 20, 2026
Merged

Fix race condition in TransferManager async cache ref counting#20918
andrross merged 1 commit intoopensearch-project:mainfrom
andrross:TransferManager-race

Conversation

@andrross
Copy link
Member

The handle callback in asyncLoadIndexInput unconditionally called fileCache.decRef on both success and failure paths. On failure, the entry was already removed by fileCache.remove in the catch block. If a new entry was added with the same key between the remove and decRef, the decRef would decrement the new entry's ref count, causing premature eviction and a NullPointerException when the evicted entry's IndexInput was cloned.

Move decRef to the success-only branch of the handle callback. Also unwrap CompletionException in the handle callback and avoid re-wrapping RuntimeExceptions in the catch block to prevent double-wrapping that broke IOException extraction in getIndexInput().

Resolves #18872

Check List

  • Functionality includes testing.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@andrross andrross requested a review from anasalkouz as a code owner March 19, 2026 00:53
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Mar 19, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 19, 2026

PR Reviewer Guide 🔍

(Review updated until commit 8896810)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Missing decRef on close

On the success path, fileCache.decRef is called inside the handle callback. However, if isClosed becomes true after isStarted is set but before the async operation completes, the ref count increment from the initial cache lookup may never be decremented on the failure path. It should be verified that all code paths that increment the ref count have a corresponding decRef, including edge cases where the future is cancelled or the executor rejects the task.

}, executor).handle((indexInput, throwable) -> {
    if (throwable != null) {
        // On failure, the entry was already removed from the cache in the
        // catch block above, so we must not decRef here.
        Throwable cause = (throwable instanceof CompletionException && throwable.getCause() != null)
            ? throwable.getCause()
            : throwable;
        result.completeExceptionally(cause);
    } else {
        fileCache.decRef(request.getFilePath());
        result.complete(indexInput);
    }
    return null;
RuntimeException cast safety

The cast (RuntimeException) e after e instanceof RuntimeException is safe, but if createIndexInput throws an Error (which is not an Exception), it won't be caught here. More importantly, if e is already a CompletionException, it will be re-thrown as-is, which is correct. However, the unwrapping logic in the handle callback only unwraps one level of CompletionException. If the original exception was itself a CompletionException, the unwrapping may expose an unexpected cause type to callers.

        throw (e instanceof RuntimeException) ? (RuntimeException) e : new CompletionException(e);
    }
}, executor).handle((indexInput, throwable) -> {
    if (throwable != null) {
        // On failure, the entry was already removed from the cache in the
        // catch block above, so we must not decRef here.
        Throwable cause = (throwable instanceof CompletionException && throwable.getCause() != null)
            ? throwable.getCause()
            : throwable;
        result.completeExceptionally(cause);

@github-actions
Copy link
Contributor

github-actions bot commented Mar 19, 2026

PR Code Suggestions ✨

Latest suggestions up to 8896810

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Ensure consistent exception wrapping behavior

When a RuntimeException is thrown directly (not wrapped in CompletionException), the
.handle() block will receive it wrapped in a CompletionException by the
CompletableFuture framework anyway. However, the unwrapping logic in the handle
block checks throwable instanceof CompletionException, which will still work. The
real issue is that throwing a raw RuntimeException from supplyAsync causes it to be
wrapped in a CompletionException by the framework, so the unwrapping in the handle
block correctly extracts the cause. This is consistent, but it's worth ensuring the
unwrapping logic handles the case where throwable.getCause() is null for a
CompletionException (already handled). The logic is correct as-is, but consider
always wrapping in CompletionException for consistency and to avoid confusion about
which exceptions get double-wrapped.

server/src/main/java/org/opensearch/index/store/remote/utils/TransferManager.java [246]

-throw (e instanceof RuntimeException) ? (RuntimeException) e : new CompletionException(e);
+throw new CompletionException(e);
Suggestion importance[1-10]: 4

__

Why: The suggestion to always wrap in CompletionException is a valid consistency argument, but the current PR code correctly handles both cases in the handle block. The improvement is minor and the existing code is functionally correct.

Low
Handle already-completed future on success path

If result.complete(indexInput) throws an exception (e.g., the future was already
completed), fileCache.decRef will have been called but the caller may not receive
the indexInput, potentially causing a resource leak where the IndexInput is never
closed. Consider using a try-finally or checking the result of complete to handle
this edge case.

server/src/main/java/org/opensearch/index/store/remote/utils/TransferManager.java [256-259]

 } else {
-    fileCache.decRef(request.getFilePath());
-    result.complete(indexInput);
+    if (!result.complete(indexInput)) {
+        // Future was already completed (e.g., cancelled); release the ref
+        fileCache.decRef(request.getFilePath());
+    } else {
+        fileCache.decRef(request.getFilePath());
+    }
 }
Suggestion importance[1-10]: 1

__

Why: The improved_code calls fileCache.decRef in both branches of the if (!result.complete(indexInput)) check, making it functionally identical to the original code. The suggestion doesn't actually address the stated concern about resource leaks in a meaningful way.

Low

Previous suggestions

Suggestions up to commit c183d11
CategorySuggestion                                                                                                                                    Impact
Possible issue
Ensure cache ref is decremented even if completion fails

If createIndexInput succeeds but result.complete(indexInput) throws an exception
(e.g., the future is already completed), fileCache.decRef will have been called but
the caller may never receive the IndexInput, potentially leaking the resource. The
decRef and result.complete should be in a try-finally or the order should ensure the
ref is always properly managed even if complete fails.

server/src/main/java/org/opensearch/index/store/remote/utils/TransferManager.java [249-259]

 if (throwable != null) {
     // On failure, the entry was already removed from the cache in the
     // catch block above, so we must not decRef here.
     Throwable cause = (throwable instanceof CompletionException && throwable.getCause() != null)
         ? throwable.getCause()
         : throwable;
     result.completeExceptionally(cause);
 } else {
-    fileCache.decRef(request.getFilePath());
-    result.complete(indexInput);
+    try {
+        result.complete(indexInput);
+    } finally {
+        fileCache.decRef(request.getFilePath());
+    }
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies a potential resource leak if result.complete(indexInput) throws an exception before fileCache.decRef is called. Using try-finally ensures the ref is always decremented, which is a valid defensive programming practice, though CompletableFuture.complete rarely throws in practice.

Low
General
Ensure consistent exception wrapping in async handler

When a RuntimeException is thrown directly (not wrapped in CompletionException), the
.handle() block receives it as a raw RuntimeException, not a CompletionException.
This means the unwrapping logic throwable instanceof CompletionException will never
match for these cases, which is fine, but the original intent to avoid
double-wrapping should be verified. More critically, if the exception is a
RuntimeException, it will be passed directly to result.completeExceptionally(cause)
without unwrapping — which is correct — but the asymmetry between the two paths may
cause confusion and subtle bugs if the handler logic changes. Consider always
wrapping in CompletionException for consistency, or documenting this behavior
explicitly.

server/src/main/java/org/opensearch/index/store/remote/utils/TransferManager.java [246]

-throw (e instanceof RuntimeException) ? (RuntimeException) e : new CompletionException(e);
+throw new CompletionException(e);
Suggestion importance[1-10]: 3

__

Why: The suggestion to always wrap in CompletionException is a valid consistency argument, but the PR's intent is to avoid double-wrapping RuntimeException. The unwrapping logic in the handle block correctly handles both cases, so this is more of a style concern than a bug. The improved_code reverts to the old behavior which the PR explicitly changed.

Low
Suggestions up to commit 0fc91bf
CategorySuggestion                                                                                                                                    Impact
General
Ensure consistent exception wrapping in async supply

When a RuntimeException is thrown directly (not wrapped in CompletionException), the
.handle() block will receive it as a CompletionException wrapping the original
exception. The unwrapping logic in the handle block uses throwable.getCause() !=
null to detect this, but a RuntimeException thrown directly from supplyAsync will
still be wrapped in a CompletionException by the framework. This means the behavior
is consistent regardless of whether you wrap it or not, but throwing a raw
RuntimeException may cause unexpected behavior in other exception handlers that
expect a CompletionException. Consider always wrapping in CompletionException for
consistency.

server/src/main/java/org/opensearch/index/store/remote/utils/TransferManager.java [246]

-throw (e instanceof RuntimeException) ? (RuntimeException) e : new CompletionException(e);
+throw new CompletionException(e);
Suggestion importance[1-10]: 4

__

Why: The suggestion raises a valid point about consistency, but the PR's intent is to avoid double-wrapping RuntimeException in CompletionException. The unwrapping logic in the handle block already handles both cases correctly, so this is a minor style concern rather than a bug.

Low
Handle exceptions from cache removal safely

If fileCache.remove(request.getFilePath()) throws an exception, the original
exception e will be lost and the ref count state may be inconsistent. Consider using
a try-finally or catching exceptions from remove to ensure the original exception is
always propagated correctly.

server/src/main/java/org/opensearch/index/store/remote/utils/TransferManager.java [244-246]

 } catch (Exception e) {
-    fileCache.remove(request.getFilePath());
-    throw (e instanceof RuntimeException) ? (RuntimeException) e : new CompletionException(e);
+    try {
+        fileCache.remove(request.getFilePath());
+    } catch (Exception removeEx) {
+        e.addSuppressed(removeEx);
+    }
+    throw new CompletionException(e);
 }
Suggestion importance[1-10]: 3

__

Why: While the suggestion to protect against exceptions from fileCache.remove() is theoretically valid, this is an unlikely edge case and the improved_code also changes the exception wrapping behavior (always wrapping in CompletionException), which contradicts the PR's intent of preserving RuntimeException types.

Low

@github-actions
Copy link
Contributor

❌ Gradle check result for 0fc91bf: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@andrross andrross force-pushed the TransferManager-race branch from 0fc91bf to c183d11 Compare March 19, 2026 16:30
@github-actions
Copy link
Contributor

Persistent review updated to latest commit c183d11

@github-actions
Copy link
Contributor

✅ Gradle check result for c183d11: SUCCESS

@codecov
Copy link

codecov bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.31%. Comparing base (9f5d0e1) to head (c183d11).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...arch/index/store/remote/utils/TransferManager.java 66.66% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               main   #20918   +/-   ##
=========================================
  Coverage     73.30%   73.31%           
- Complexity    72484    72553   +69     
=========================================
  Files          5819     5819           
  Lines        331155   331241   +86     
  Branches      47840    47862   +22     
=========================================
+ Hits         242769   242845   +76     
- Misses        68876    68950   +74     
+ Partials      19510    19446   -64     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The handle callback in asyncLoadIndexInput unconditionally called
fileCache.decRef on both success and failure paths. On failure,
the entry was already removed by fileCache.remove in the catch
block. If a new entry was added with the same key between the
remove and decRef, the decRef would decrement the new entry's
ref count, causing premature eviction and a NullPointerException
when the evicted entry's IndexInput was cloned.

Move decRef to the success-only branch of the handle callback.
Also unwrap CompletionException in the handle callback and avoid
re-wrapping RuntimeExceptions in the catch block to prevent
double-wrapping that broke IOException extraction in
getIndexInput().

Signed-off-by: Andrew Ross <andrross@amazon.com>
@andrross andrross force-pushed the TransferManager-race branch from c183d11 to 8896810 Compare March 20, 2026 18:24
@github-actions
Copy link
Contributor

Persistent review updated to latest commit 8896810

@github-actions
Copy link
Contributor

❌ Gradle check result for 8896810: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❕ Gradle check result for 8896810: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@andrross andrross merged commit 527ef86 into opensearch-project:main Mar 20, 2026
36 of 39 checks passed
@andrross andrross deleted the TransferManager-race branch March 20, 2026 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run skip-changelog >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for TransferManagerBlobContainerReaderTests

2 participants