Skip to content

Fix flaky testOverflowDisabledAsynchronous by throwing AlreadyClosedException#20848

Merged
andrross merged 1 commit intoopensearch-project:mainfrom
andrross:fix-flaky-tm-test
Mar 12, 2026
Merged

Fix flaky testOverflowDisabledAsynchronous by throwing AlreadyClosedException#20848
andrross merged 1 commit intoopensearch-project:mainfrom
andrross:fix-flaky-tm-test

Conversation

@andrross
Copy link
Member

@andrross andrross commented Mar 12, 2026

When the file cache is full and a blob fetch races with cache eviction, DelayedCreationCachedIndexInput.getIndexInput() throws an IllegalStateException, so this surfaces as an IllegalStateException instead of the IOException callers expect. Replace IllegalStateException with Lucene's AlreadyClosedException (which extends IOException) in both getIndexInput() and asyncLoadIndexInput().

Resolves #18850
Resolves #16676
Resolves #18872

Check List

  • Functionality includes testing.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…xception

When the file cache is full and a blob fetch races with cache eviction,
DelayedCreationCachedIndexInput.getIndexInput() throws an
IllegalStateException, so this surfaces as an IllegalStateException
instead of the IOException callers expect. Replace IllegalStateException
with Lucene's AlreadyClosedException (which extends IOException) in both
getIndexInput() and asyncLoadIndexInput().

Signed-off-by: Andrew Ross <andrross@amazon.com>
@github-actions
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Missing Return

In getIndexInput(), after throwing AlreadyClosedException when isClosed is true, the code previously fell through to the rest of the method. The old code was missing a return or throw statement after the if (isClosed.get()) block, which means the new throw statement actually fixes a latent bug where execution would continue past the closed check. Verify that the original code without the throw was intentionally a no-op (which seems like a bug) and confirm the new behavior is correct.

if (isClosed.get()) {
    throw new AlreadyClosedException("Already closed");
}
Missing decRef

In getIndexInput(), when isClosed is true and AlreadyClosedException is thrown, there is no corresponding fileCache.decRef(request.getFilePath()) call, unlike in asyncLoadIndexInput() which does call decRef before returning the failed future. If a ref was acquired before calling getIndexInput(), this could cause a reference count leak.

if (isClosed.get()) {
    throw new AlreadyClosedException("Already closed");
}

@github-actions
Copy link
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Ensure consistent reference count cleanup on close

In getIndexInput(), when isClosed is true, fileCache.decRef is not called before
throwing AlreadyClosedException, but in asyncLoadIndexInput it is called. This
inconsistency may cause a reference count leak in the synchronous path. Consider
calling fileCache.decRef(request.getFilePath()) in getIndexInput() as well before
throwing the exception.

server/src/main/java/org/opensearch/index/store/remote/utils/TransferManager.java [235-238]

 if (isClosed.get()) {
                 fileCache.decRef(request.getFilePath());
-                return CompletableFuture.failedFuture(new AlreadyClosedException("Already closed"));
+                throw new AlreadyClosedException("Already closed");
             }
Suggestion importance[1-10]: 7

__

Why: The suggestion identifies a real inconsistency: asyncLoadIndexInput calls fileCache.decRef before returning the failed future, but getIndexInput does not call fileCache.decRef before throwing AlreadyClosedException. This could lead to a reference count leak in the synchronous path, making it a valid and meaningful fix.

Medium

@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut bug Something isn't working flaky-test Random test failure that succeeds on second run Search:Remote Search labels Mar 12, 2026
@github-actions
Copy link
Contributor

❌ Gradle check result for 3586e08: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@andrross
Copy link
Member Author

PR Code Suggestions ✨

Explore these optional code suggestions:
Category **Suggestion ** Impact
Possible issue
Ensure consistent reference count cleanup on close

Medium

The reference counting is unchanged here. The fix is to throw the correct exception. The race is that there is a very short period of time where a closed index input is still in the cache. If the test observed this case it would see an IllegalStateException. The other path where there is no existing entry in the cache throws an IOException as expected. The change is to make the exception consistent in both cases.

@github-actions
Copy link
Contributor

❕ Gradle check result for 3586e08: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@codecov
Copy link

codecov bot commented Mar 12, 2026

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.35%. Comparing base (db61a88) to head (3586e08).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...arch/index/store/remote/utils/TransferManager.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20848      +/-   ##
============================================
+ Coverage     73.30%   73.35%   +0.04%     
- Complexity    72252    72296      +44     
============================================
  Files          5795     5795              
  Lines        330056   330056              
  Branches      47643    47643              
============================================
+ Hits         241947   242098     +151     
+ Misses        68695    68515     -180     
- Partials      19414    19443      +29     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andrross andrross merged commit 5ca112b into opensearch-project:main Mar 12, 2026
63 of 68 checks passed
@andrross andrross deleted the fix-flaky-tm-test branch March 12, 2026 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut bug Something isn't working flaky-test Random test failure that succeeds on second run lucene Search:Remote Search skip-changelog >test-failure Test failure from CI, local build, etc.

Projects

None yet

2 participants