Skip to content

Conversation

@drempapis
Copy link
Contributor

The current implementation of MutableSearchResponse.toAsyncSearchResponse() assumes that if finalResponse != null, the SearchResponse can always be retained. It enforces this by calling mustIncRef().

However, under concurrent execution, this assumption breaks down. One thread may be serializing a response while another (via AsyncSearchTask.close()) decrements the reference count. If decRef() drops the count to 0, the object is released. A later mustIncRef() will then fail, leading to assertion errors or use-after-release. This creates a race condition that can cause sporadic failures when fetching async search results, especially in the narrow window between task closure and document deletion.

This PR intends to improve the safety of MutableSearchResponse under concurrent access. The goal is

  • Allow a thread that is already building a response to complete and return it.
  • Prevent subsequent calls from accessing the resource once it has been closed/released.

To achieve this, the MutableSearchResponse is now ref-counted. Any thread building a response must first successfully acquire a reference. If the container (msr) has already been released, the attempt fails with a GONE status instead of tripping assertions. Finally the AsyncSearchTask#getResponse holds and releases the msr explicitly to prevent use-after-close races.

@drempapis drempapis added >bug Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations v9.2.0 labels Sep 9, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine
Copy link
Collaborator

Hi @drempapis, I've created a changelog YAML for you.

@javanna javanna self-requested a review September 16, 2025 14:26
Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Dimi. It'll be great to have this problem fixed soon !

I'm not super sure the proposed solution fixes the problem just yet, and also left a question about the behaviour (I think we should strive to return the results to the user, rather than a different exception?).

Apologies if I misunderstood something essential here.

checkCancellation();
AsyncSearchResponse asyncSearchResponse;
if (mutableSearchResponse.tryIncRef() == false) {
throw new ElasticsearchStatusException("async-search result, no longer available", RestStatus.GONE);
Copy link
Contributor

@andreidan andreidan Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the request will still fail, just a different message? (i.e. results in Discover will still not be displayed right?)

In the race condition you describe we'd have the results on this. Should we not return them from disk instead of failing here?

Or with the new response, do clients know to retry automatically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @andreidan, for reviewing this.

I'm not super sure the proposed solution fixes the problem just yet

The problem is a race: one thread (A) is building a response from MutableSearchResponse while another (B) closes the task and drops the last ref, so the reader (A) hits already closed, can’t increment ref count when calling the getResponse().

The idea here is to allow a thread that is already building a response to complete and return, while preventing subsequent calls from accessing the resource once it has been closed/released.

IIUC the request will still fail, just a different message?

With the code as written, the focus is on mitigating the race. If tryIncRef() fails, we throw GONE instead of hitting the assertion. The request still fails, but it does so in a controlled, user-facing way rather than with an assertion error.

In the race condition you describe, we'd have the results on this. Should we not return them from disk instead of failing here?

That's a good point. To guarantee safety, if tryIncRef() fails, we stop touching the in-memory finalResponse.

We may add a fallback. If the task is completed and the container is already closed, we should load and return the stored async-search result from disk, and only return GONE when the result is not found on disk.

  • tryIncRef succeeds -> build from in-memory (fast path) -> 200.

  • tryIncRef fails and stored doc exists -> load from .async-search -> 200.

  • tryIncRef fails and stored doc missing (expired via keep_alive or deleted), -> GONE.

@andreidan
Copy link
Contributor

Thanks for the explanations @drempapis ! ❤️

I think I understand the proposed solution a bit more.

Fixing this without able to reproduce it or fully understanding the problem gives me a bit of pause. Namely, I'd like to understand why would an async search task be closed whilst still receiving results ?
Can we please enable this logging we added in some test suites and see who's calling close when the issue reproduces? (perhaps in the test suites where we previously saw this failure, albeit sporadically)

We can also create a test that tries to enable the race condition more often (widen that window of opportunity for the bug to surface) if the theoretical reproduction scenario is fully understood ?

If all of the above fails, could we simulate the bug with breakpoints?

The goal here would be for the search operations involved in this race condition to be successful (as opposed to changing the type of error we return).

@drempapis
Copy link
Contributor Author

Thank you @andreidan for iterating on this.

I have the same concern. Without being able to reliably reproduce the scenario in a controlled environment, all we can do is hypothesize. In particular, the suggestion that this issue arises from a race between mustIncRef() and close() makes sense in theory, but without a reproducer or stack traces that prove the two threads are actually colliding, it remains only an assumption.

I’ve enabled logging for the async-search tests under
x-pack/plugin/async-search/src/internalClusterTest/java/org/elasticsearch/xpack/searchon main. Since this is test-only instrumentation (no changes to product code) and fully gated by log levels, I propose backporting it to the active maintenance branches where these tests are present. The goal is to increase observability in CI and gain insights when the issue reproduces.

I’ve attempted several approaches to reproduce the error locally, but so far without success:

  • Running the async-search internal cluster tests in a loop under heavy CPU load, e.g.
Console 1
stress-ng --cpu 16

Console 2
for i in {1..2500}
do
  echo "Run #$i"
  ./gradlew :x-pack:plugin:async-search:internalClusterTest
done
  • Implemented an integration test designed to create race conditions by spawning concurrent threads issuing GET and DELETE requests against async-search tasks.
  • Tried to simulate the problem by suspending execution with breakpoints. However, placing a breakpoint in the critical path ends up parking the search worker while it still holds the same monitor that the DELETE - AsyncSearchTask.close() path needs. In this setup, the cancel thread blocks waiting to acquire the monitor.

The only way I’ve found to reproduce the issue is via a debugger-induced race

  • Set a line breakpoint in MutableSearchResponse#toAsyncSearchResponse at the call to searchResponse.mustIncRef()
  • Trigger an async search so the breakpoint is hit and the search worker is paused
  • In the debugger, Evaluate task.close()

This is the only way I managed to get the assertioninvalid decRef call: already closed.

Although I was not able to reliably reproduce the original error, I believe this change is still an improvement: Making MutableSearchResponse ref-counted decouples its lifecycle from AsyncSearchTask so that cleanup happens only when the last active call releases it. Previously, the task could close its response while other threads were still building or serializing results, potentially leading to races, assertions, or use-after-close behavior.

With reference counting, any thread that needs to build a response must successfully acquire a reference; as long as one is held, the response stays alive. Once all holders release it, closeInternal() runs and safely tears down resources. This provides predictable cleanup, avoids subtle concurrency bugs, and aligns the resource lifecycle with actual usage rather than with task closure alone.

@andreidan
Copy link
Contributor

Thanks @drempapis.

In the debugger, Evaluate task.close()

I think we should discuss search oriented scenarios (we're still not sure who closes the task and why). My suggestion was around adding breakpoints to delay a part of the search flow to trigger the race condition.

I looked at when we close the SearchTask and it happens in a few cases:

  1. search operation is completed within the timeout https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/TransportSubmitAsyncSearchAction.java#L167
  2. we received the final underlying search response https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/TransportSubmitAsyncSearchAction.java#L250
  3. a failure when trying to index the response or create the underlying store https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/TransportSubmitAsyncSearchAction.java#L144
    (note that onFatalFailure eventually ends up in closeTaskAndFail https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/TransportSubmitAsyncSearchAction.java#L235 - to check if this is called on timeout too, I suspect it is)

The scenario we suspect is that any of the above 3 scenarios runs concurrently with a request to fetch the async search task response https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java#L352 which ends up in the underlying MutableSearchResponse#toAsyncSearchResponse (and that mustIncRef call that causes triggers the use after close https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java#L249 )

Looking at when we request to fetch the search task response (the actual, potentially user induced, action that races with any of the cases when the search task is closed:

  1. when we add a completion listener to the task https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java#L210 (this is called by GET _async_search/status/{id} API) - also internally part of the same call https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java#L253
  2. when we submit the async search task if the underlying search is still running https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/TransportSubmitAsyncSearchAction.java#L126 and to return the first batch of results
  3. and finally, when the underlying search operation completes and notifies the async search task via the search progress listener https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java#L544

Can we create a test where any of the 1-3 scenarios in the first category, races with any of the 1-3 scenarios int he second category?
For e.g. could we concurrently and repeatedly call GET _async_search/status/{id} (point 1 in the second category) whilst we run the {id} async search (to race with point 2. in the first category, i.e. the async search operation just finishing successfully)?
Or could we concurrently and repeatedly call GET _async_search/status/{id} (point 1 in the second category) in parallel with an async search operation over a large corpus but with a short timeout/ completion time? (point 3 in the first category)

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this Dimi.

I think this is almost ready - just a few rather minor comments 🚀

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this Dimi. LGTM

@drempapis drempapis added v9.1.0 v9.2.0 auto-backport Automatically create backport pull requests when merged v8.20.0 v8.19.0 and removed v8.20.0 labels Nov 4, 2025
@drempapis drempapis merged commit f5cb6ea into elastic:main Nov 4, 2025
34 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.19 Commit could not be cherrypicked due to conflicts
9.1 Commit could not be cherrypicked due to conflicts
9.2

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 134359

@drempapis
Copy link
Contributor Author

💔 Some backports could not be created

Status Branch Result
9.1
8.19 Conflict resolution was aborted by the user

Manual backport

To create the backport manually run:

backport --pr 134359

Questions ?

Please refer to the Backport tool documentation

drempapis added a commit to drempapis/elasticsearch that referenced this pull request Nov 4, 2025
…async search (elastic#134359)

(cherry picked from commit f5cb6ea)

# Conflicts:
#	x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java
#	x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 4, 2025
BASE=c6bdd287a48ea01aace7d8d53a48c73f33ba4583
HEAD=063b5a8ee5cab13f0e75ca75111ff542f4522362
Branch=main
phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 4, 2025
BASE=c6bdd287a48ea01aace7d8d53a48c73f33ba4583
HEAD=063b5a8ee5cab13f0e75ca75111ff542f4522362
Branch=main
elasticsearchmachine pushed a commit that referenced this pull request Nov 5, 2025
…se in async search (#134359) (#137579)

* Make MutableSearchResponse ref-counted to prevent use-after-close in async search (#134359)

(cherry picked from commit f5cb6ea)

# Conflicts:
#	x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java
#	x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java

* Add logging import to AsyncSearchTask.java

* Add logging imports to MutableSearchResponse

* update code with missing Logger definition
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v8.19.0 v9.1.0 v9.2.0 v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants