Make MutableSearchResponse ref-counted to prevent use-after-close in async search #134359

drempapis · 2025-09-09T10:52:02Z

The current implementation of MutableSearchResponse.toAsyncSearchResponse() assumes that if finalResponse != null, the SearchResponse can always be retained. It enforces this by calling mustIncRef().

However, under concurrent execution, this assumption breaks down. One thread may be serializing a response while another (via AsyncSearchTask.close()) decrements the reference count. If decRef() drops the count to 0, the object is released. A later mustIncRef() will then fail, leading to assertion errors or use-after-release. This creates a race condition that can cause sporadic failures when fetching async search results, especially in the narrow window between task closure and document deletion.

This PR intends to improve the safety of MutableSearchResponse under concurrent access. The goal is

Allow a thread that is already building a response to complete and return it.
Prevent subsequent calls from accessing the resource once it has been closed/released.

To achieve this, the MutableSearchResponse is now ref-counted. Any thread building a response must first successfully acquire a reference. If the container (msr) has already been released, the attempt fails with a GONE status instead of tripping assertions. Finally the AsyncSearchTask#getResponse holds and releases the msr explicitly to prevent use-after-close races.

elasticsearchmachine · 2025-09-09T10:52:25Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elasticsearchmachine · 2025-09-09T10:52:25Z

Hi @drempapis, I've created a changelog YAML for you.

…/elasticsearch into fix/refcounting-async-response

x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java

andreidan

Thanks for working on this Dimi. It'll be great to have this problem fixed soon !

I'm not super sure the proposed solution fixes the problem just yet, and also left a question about the behaviour (I think we should strive to return the results to the user, rather than a different exception?).

Apologies if I misunderstood something essential here.

x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java

andreidan · 2025-09-22T16:04:00Z

x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java

        checkCancellation();
        AsyncSearchResponse asyncSearchResponse;
+        if (mutableSearchResponse.tryIncRef() == false) {
+            throw new ElasticsearchStatusException("async-search result, no longer available", RestStatus.GONE);


IIUC the request will still fail, just a different message? (i.e. results in Discover will still not be displayed right?)

In the race condition you describe we'd have the results on this. Should we not return them from disk instead of failing here?

Or with the new response, do clients know to retry automatically?

Thank you, @andreidan, for reviewing this.

I'm not super sure the proposed solution fixes the problem just yet

The problem is a race: one thread (A) is building a response from MutableSearchResponse while another (B) closes the task and drops the last ref, so the reader (A) hits already closed, can’t increment ref count when calling the getResponse().

The idea here is to allow a thread that is already building a response to complete and return, while preventing subsequent calls from accessing the resource once it has been closed/released.

IIUC the request will still fail, just a different message?

With the code as written, the focus is on mitigating the race. If tryIncRef() fails, we throw GONE instead of hitting the assertion. The request still fails, but it does so in a controlled, user-facing way rather than with an assertion error.

In the race condition you describe, we'd have the results on this. Should we not return them from disk instead of failing here?

That's a good point. To guarantee safety, if tryIncRef() fails, we stop touching the in-memory finalResponse.

We may add a fallback. If the task is completed and the container is already closed, we should load and return the stored async-search result from disk, and only return GONE when the result is not found on disk.

tryIncRef succeeds -> build from in-memory (fast path) -> 200.

tryIncRef fails and stored doc exists -> load from .async-search -> 200.

tryIncRef fails and stored doc missing (expired via keep_alive or deleted), -> GONE.

andreidan · 2025-09-23T14:31:33Z

Thanks for the explanations @drempapis ! ❤️

I think I understand the proposed solution a bit more.

Fixing this without able to reproduce it or fully understanding the problem gives me a bit of pause. Namely, I'd like to understand why would an async search task be closed whilst still receiving results ?
Can we please enable this logging we added in some test suites and see who's calling close when the issue reproduces? (perhaps in the test suites where we previously saw this failure, albeit sporadically)

We can also create a test that tries to enable the race condition more often (widen that window of opportunity for the bug to surface) if the theoretical reproduction scenario is fully understood ?

If all of the above fails, could we simulate the bug with breakpoints?

The goal here would be for the search operations involved in this race condition to be successful (as opposed to changing the type of error we return).

drempapis · 2025-09-29T13:43:42Z

Thank you @andreidan for iterating on this.

I have the same concern. Without being able to reliably reproduce the scenario in a controlled environment, all we can do is hypothesize. In particular, the suggestion that this issue arises from a race between mustIncRef() and close() makes sense in theory, but without a reproducer or stack traces that prove the two threads are actually colliding, it remains only an assumption.

I’ve enabled logging for the async-search tests under
x-pack/plugin/async-search/src/internalClusterTest/java/org/elasticsearch/xpack/searchon main. Since this is test-only instrumentation (no changes to product code) and fully gated by log levels, I propose backporting it to the active maintenance branches where these tests are present. The goal is to increase observability in CI and gain insights when the issue reproduces.

I’ve attempted several approaches to reproduce the error locally, but so far without success:

Running the async-search internal cluster tests in a loop under heavy CPU load, e.g.

Console 1
stress-ng --cpu 16

Console 2
for i in {1..2500}
do
  echo "Run #$i"
  ./gradlew :x-pack:plugin:async-search:internalClusterTest
done

Implemented an integration test designed to create race conditions by spawning concurrent threads issuing GET and DELETE requests against async-search tasks.
Tried to simulate the problem by suspending execution with breakpoints. However, placing a breakpoint in the critical path ends up parking the search worker while it still holds the same monitor that the DELETE - AsyncSearchTask.close() path needs. In this setup, the cancel thread blocks waiting to acquire the monitor.

The only way I’ve found to reproduce the issue is via a debugger-induced race

Set a line breakpoint in MutableSearchResponse#toAsyncSearchResponse at the call to searchResponse.mustIncRef()
Trigger an async search so the breakpoint is hit and the search worker is paused
In the debugger, Evaluate task.close()

This is the only way I managed to get the assertioninvalid decRef call: already closed.

Although I was not able to reliably reproduce the original error, I believe this change is still an improvement: Making MutableSearchResponse ref-counted decouples its lifecycle from AsyncSearchTask so that cleanup happens only when the last active call releases it. Previously, the task could close its response while other threads were still building or serializing results, potentially leading to races, assertions, or use-after-close behavior.

With reference counting, any thread that needs to build a response must successfully acquire a reference; as long as one is held, the response stays alive. Once all holders release it, closeInternal() runs and safely tears down resources. This provides predictable cleanup, avoids subtle concurrency bugs, and aligns the resource lifecycle with actual usage rather than with task closure alone.

andreidan · 2025-10-02T10:23:26Z

Thanks @drempapis.

In the debugger, Evaluate task.close()

I think we should discuss search oriented scenarios (we're still not sure who closes the task and why). My suggestion was around adding breakpoints to delay a part of the search flow to trigger the race condition.

I looked at when we close the SearchTask and it happens in a few cases:

search operation is completed within the timeout https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/TransportSubmitAsyncSearchAction.java#L167
we received the final underlying search response https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/TransportSubmitAsyncSearchAction.java#L250
a failure when trying to index the response or create the underlying store https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/TransportSubmitAsyncSearchAction.java#L144
(note that onFatalFailure eventually ends up in closeTaskAndFail https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/TransportSubmitAsyncSearchAction.java#L235 - to check if this is called on timeout too, I suspect it is)

The scenario we suspect is that any of the above 3 scenarios runs concurrently with a request to fetch the async search task response https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java#L352 which ends up in the underlying MutableSearchResponse#toAsyncSearchResponse (and that mustIncRef call that causes triggers the use after close https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java#L249 )

Looking at when we request to fetch the search task response (the actual, potentially user induced, action that races with any of the cases when the search task is closed:

when we add a completion listener to the task https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java#L210 (this is called by GET _async_search/status/{id} API) - also internally part of the same call https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java#L253
when we submit the async search task if the underlying search is still running https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/TransportSubmitAsyncSearchAction.java#L126 and to return the first batch of results

elasticsearch/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/TransportSubmitAsyncSearchAction.java

Line 130 in 0592415

submitListenerWithHeaders.onResponse(searchResponse);
and finally, when the underlying search operation completes and notifies the async search task via the search progress listener https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java#L544

Can we create a test where any of the 1-3 scenarios in the first category, races with any of the 1-3 scenarios int he second category?
For e.g. could we concurrently and repeatedly call GET _async_search/status/{id} (point 1 in the second category) whilst we run the {id} async search (to race with point 2. in the first category, i.e. the async search operation just finishing successfully)?
Or could we concurrently and repeatedly call GET _async_search/status/{id} (point 1 in the second category) in parallel with an async search operation over a large corpus but with a short timeout/ completion time? (point 3 in the first category)

…/elasticsearch into fix/refcounting-async-response

andreidan

Thanks for iterating on this Dimi.

I think this is almost ready - just a few rather minor comments 🚀

...c/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchConcurrentStatusIT.java

x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java

andreidan

Thanks for iterating on this Dimi. LGTM

x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java

elasticsearchmachine · 2025-11-04T15:19:42Z

💔 Backport failed

Status	Branch	Result
❌	8.19	Commit could not be cherrypicked due to conflicts
❌	9.1	Commit could not be cherrypicked due to conflicts
✅	9.2

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 134359

…async search (elastic#134359)

drempapis · 2025-11-04T15:34:56Z

💔 Some backports could not be created

Status	Branch	Result
✅	9.1
❌	8.19	Conflict resolution was aborted by the user

Manual backport

To create the backport manually run:

backport --pr 134359

Questions ?

Please refer to the Backport tool documentation

…async search (elastic#134359) (cherry picked from commit f5cb6ea) # Conflicts: # x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java # x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java

…async search (#134359) (#137578)

BASE=c6bdd287a48ea01aace7d8d53a48c73f33ba4583 HEAD=063b5a8ee5cab13f0e75ca75111ff542f4522362 Branch=main

…se in async search (#134359) (#137579) * Make MutableSearchResponse ref-counted to prevent use-after-close in async search (#134359) (cherry picked from commit f5cb6ea) # Conflicts: # x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java # x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java * Add logging import to AsyncSearchTask.java * Add logging imports to MutableSearchResponse * update code with missing Logger definition

AsyncSearch: make MutableSearchResponse ref-counted

f4a1a81

drempapis added >bug Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations v9.2.0 labels Sep 9, 2025

drempapis and others added 5 commits September 9, 2025 13:52

Update docs/changelog/134359.yaml

025e9e3

[CI] Auto commit changes from spotless

47f554e

apply spot

4a2cdc0

Merge branch 'fix/refcounting-async-response' of github.com:drempapis…

e3b7c64

…/elasticsearch into fix/refcounting-async-response

Merge branch 'main' into fix/refcounting-async-response

f7db807

andreidan self-requested a review September 10, 2025 15:02

drempapis mentioned this pull request Sep 11, 2025

AsyncSearch - Handle released finalResponse with tryIncRef() to avoid race assertions #134299

Closed

benchaplin reviewed Sep 12, 2025

View reviewed changes

x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java Show resolved Hide resolved

javanna self-requested a review September 16, 2025 14:26

andreidan reviewed Sep 22, 2025

View reviewed changes

elasticsearchmachine added v9.3.0 and removed v9.2.0 labels Oct 2, 2025

drempapis and others added 8 commits October 14, 2025 11:03

Merge branch 'main' into fix/refcounting-async-response

beb3f23

update code

0c7e61d

update code

b02ce52

Merge branch 'fix/refcounting-async-response' of github.com:drempapis…

966ad7b

…/elasticsearch into fix/refcounting-async-response

[CI] Auto commit changes from spotless

987434e

update AsyncSearchTask to fallback from the store if entry available

91eac66

Merge branch 'fix/refcounting-async-response' of github.com:drempapis…

f23f78f

…/elasticsearch into fix/refcounting-async-response

remove files

dd9567d

drempapis and others added 2 commits October 22, 2025 17:30

code udpated after review

547a050

[CI] Auto commit changes from spotless

adada6a

andreidan reviewed Oct 27, 2025

View reviewed changes

drempapis added 4 commits November 3, 2025 09:54

Merge branch 'main' into fix/refcounting-async-response

9ce4cc7

update after review|

634e698

Merge branch 'main' into fix/refcounting-async-response

ce9f54d

Merge branch 'main' into fix/refcounting-async-response

e552d16

andreidan approved these changes Nov 4, 2025

View reviewed changes

x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java Show resolved Hide resolved

drempapis added 2 commits November 4, 2025 15:43

Merge branch 'main' into fix/refcounting-async-response

2ac0077

Add comments for the given choice

063b5a8

drempapis added v9.1.0 v9.2.0 auto-backport Automatically create backport pull requests when merged v8.20.0 v8.19.0 and removed v8.20.0 labels Nov 4, 2025

drempapis merged commit f5cb6ea into elastic:main Nov 4, 2025
34 checks passed

drempapis mentioned this pull request Nov 4, 2025

[9.2] Make MutableSearchResponse ref-counted to prevent use-after-close in async search (#134359) #137578

Merged

elasticsearchmachine added the backport pending label Nov 4, 2025

drempapis added a commit to drempapis/elasticsearch that referenced this pull request Nov 4, 2025

Make MutableSearchResponse ref-counted to prevent use-after-close in …

1f4df9a

…async search (elastic#134359)

drempapis mentioned this pull request Nov 4, 2025

[9.1] Make MutableSearchResponse ref-counted to prevent use-after-close in async search (#134359) #137579

Merged

elasticsearchmachine pushed a commit that referenced this pull request Nov 4, 2025

Make MutableSearchResponse ref-counted to prevent use-after-close in …

7d2335d

…async search (#134359) (#137578)

phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 4, 2025

Mirror upstream elastic#134359 as single snapshot commit for AI review

041943d

BASE=c6bdd287a48ea01aace7d8d53a48c73f33ba4583 HEAD=063b5a8ee5cab13f0e75ca75111ff542f4522362 Branch=main

phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 4, 2025

Mirror upstream elastic#134359 as single snapshot commit for AI review

1557f48

BASE=c6bdd287a48ea01aace7d8d53a48c73f33ba4583 HEAD=063b5a8ee5cab13f0e75ca75111ff542f4522362 Branch=main

drempapis mentioned this pull request Nov 5, 2025

[8.19] Make MutableSearchResponse ref-counted to prevent use-after-close in async search (#134359) #137610

Merged

drempapis removed the backport pending label Nov 5, 2025

Make MutableSearchResponse ref-counted to prevent use-after-close in async search #134359

Make MutableSearchResponse ref-counted to prevent use-after-close in async search #134359

Uh oh!

Conversation

drempapis commented Sep 9, 2025

Uh oh!

elasticsearchmachine commented Sep 9, 2025

Uh oh!

elasticsearchmachine commented Sep 9, 2025

Uh oh!

Uh oh!

andreidan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andreidan Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drempapis Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

andreidan commented Sep 23, 2025

Uh oh!

drempapis commented Sep 29, 2025

Uh oh!

andreidan commented Oct 2, 2025

Uh oh!

andreidan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreidan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 4, 2025

💔 Backport failed

Uh oh!

drempapis commented Nov 4, 2025

💔 Some backports could not be created

Manual backport

Questions ?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andreidan Sep 22, 2025 •

edited

Loading