-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Make MutableSearchResponse ref-counted to prevent use-after-close in async search #134359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make MutableSearchResponse ref-counted to prevent use-after-close in async search #134359
Conversation
|
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
|
Hi @drempapis, I've created a changelog YAML for you. |
…/elasticsearch into fix/refcounting-async-response
x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java
Show resolved
Hide resolved
andreidan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Dimi. It'll be great to have this problem fixed soon !
I'm not super sure the proposed solution fixes the problem just yet, and also left a question about the behaviour (I think we should strive to return the results to the user, rather than a different exception?).
Apologies if I misunderstood something essential here.
x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java
Show resolved
Hide resolved
| checkCancellation(); | ||
| AsyncSearchResponse asyncSearchResponse; | ||
| if (mutableSearchResponse.tryIncRef() == false) { | ||
| throw new ElasticsearchStatusException("async-search result, no longer available", RestStatus.GONE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC the request will still fail, just a different message? (i.e. results in Discover will still not be displayed right?)
In the race condition you describe we'd have the results on this. Should we not return them from disk instead of failing here?
Or with the new response, do clients know to retry automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @andreidan, for reviewing this.
I'm not super sure the proposed solution fixes the problem just yet
The problem is a race: one thread (A) is building a response from MutableSearchResponse while another (B) closes the task and drops the last ref, so the reader (A) hits already closed, can’t increment ref count when calling the getResponse().
The idea here is to allow a thread that is already building a response to complete and return, while preventing subsequent calls from accessing the resource once it has been closed/released.
IIUC the request will still fail, just a different message?
With the code as written, the focus is on mitigating the race. If tryIncRef() fails, we throw GONE instead of hitting the assertion. The request still fails, but it does so in a controlled, user-facing way rather than with an assertion error.
In the race condition you describe, we'd have the results on this. Should we not return them from disk instead of failing here?
That's a good point. To guarantee safety, if tryIncRef() fails, we stop touching the in-memory finalResponse.
We may add a fallback. If the task is completed and the container is already closed, we should load and return the stored async-search result from disk, and only return GONE when the result is not found on disk.
-
tryIncRefsucceeds -> build from in-memory (fast path) -> 200. -
tryIncReffails and stored doc exists -> load from .async-search -> 200. -
tryIncReffails and stored doc missing (expired via keep_alive or deleted), -> GONE.
|
Thanks for the explanations @drempapis ! ❤️ I think I understand the proposed solution a bit more. Fixing this without able to reproduce it or fully understanding the problem gives me a bit of pause. Namely, I'd like to understand why would an async search task be closed whilst still receiving results ? We can also create a test that tries to enable the race condition more often (widen that window of opportunity for the bug to surface) if the theoretical reproduction scenario is fully understood ? If all of the above fails, could we simulate the bug with breakpoints? The goal here would be for the search operations involved in this race condition to be successful (as opposed to changing the type of error we return). |
|
Thank you @andreidan for iterating on this. I have the same concern. Without being able to reliably reproduce the scenario in a controlled environment, all we can do is hypothesize. In particular, the suggestion that this issue arises from a race between I’ve enabled logging for the async-search tests under I’ve attempted several approaches to reproduce the error locally, but so far without success:
The only way I’ve found to reproduce the issue is via a debugger-induced race
This is the only way I managed to get the assertion Although I was not able to reliably reproduce the original error, I believe this change is still an improvement: Making With reference counting, any thread that needs to build a response must successfully acquire a reference; as long as one is held, the response stays alive. Once all holders release it, |
|
Thanks @drempapis.
I think we should discuss search oriented scenarios (we're still not sure who closes the task and why). My suggestion was around adding breakpoints to delay a part of the search flow to trigger the race condition. I looked at when we close the
The scenario we suspect is that any of the above 3 scenarios runs concurrently with a request to fetch the async search task response https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java#L352 which ends up in the underlying Looking at when we request to fetch the search task response (the actual, potentially user induced, action that races with any of the cases when the search task is closed:
Can we create a test where any of the 1-3 scenarios in the first category, races with any of the 1-3 scenarios int he second category? |
…/elasticsearch into fix/refcounting-async-response
…/elasticsearch into fix/refcounting-async-response
andreidan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this Dimi.
I think this is almost ready - just a few rather minor comments 🚀
...c/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchConcurrentStatusIT.java
Show resolved
Hide resolved
...c/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchConcurrentStatusIT.java
Show resolved
Hide resolved
...c/internalClusterTest/java/org/elasticsearch/xpack/search/AsyncSearchConcurrentStatusIT.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java
Show resolved
Hide resolved
andreidan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this Dimi. LGTM
x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java
Show resolved
Hide resolved
💔 Backport failed
You can use sqren/backport to manually backport by running |
💔 Some backports could not be created
Manual backportTo create the backport manually run: Questions ?Please refer to the Backport tool documentation |
…async search (elastic#134359) (cherry picked from commit f5cb6ea) # Conflicts: # x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java # x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java
BASE=c6bdd287a48ea01aace7d8d53a48c73f33ba4583 HEAD=063b5a8ee5cab13f0e75ca75111ff542f4522362 Branch=main
BASE=c6bdd287a48ea01aace7d8d53a48c73f33ba4583 HEAD=063b5a8ee5cab13f0e75ca75111ff542f4522362 Branch=main
…se in async search (#134359) (#137579) * Make MutableSearchResponse ref-counted to prevent use-after-close in async search (#134359) (cherry picked from commit f5cb6ea) # Conflicts: # x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/AsyncSearchTask.java # x-pack/plugin/async-search/src/main/java/org/elasticsearch/xpack/search/MutableSearchResponse.java * Add logging import to AsyncSearchTask.java * Add logging imports to MutableSearchResponse * update code with missing Logger definition
The current implementation of
MutableSearchResponse.toAsyncSearchResponse()assumes that iffinalResponse != null, theSearchResponsecan always be retained. It enforces this by callingmustIncRef().However, under concurrent execution, this assumption breaks down. One thread may be serializing a response while another (via AsyncSearchTask.close()) decrements the reference count. If
decRef()drops the count to0, the object is released. A latermustIncRef()will then fail, leading to assertion errors or use-after-release. This creates a race condition that can cause sporadic failures when fetching async search results, especially in the narrow window between task closure and document deletion.This PR intends to improve the safety of
MutableSearchResponseunder concurrent access. The goal isTo achieve this, the
MutableSearchResponseis now ref-counted. Any thread building a response must first successfully acquire a reference. If the container (msr) has already been released, the attempt fails with aGONEstatus instead of tripping assertions. Finally theAsyncSearchTask#getResponseholds and releases themsrexplicitly to prevent use-after-close races.