-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Description
Elasticsearch Version
main
Installed Plugins
No response
Java Version
bundled
OS Version
Darwin Kernel Version 24.6.0
Problem Description
I've run into this bug in an unrelated test, and since I got no Search expertise, I tried to do a reproduction test, which I mention in this ticket. When SearchQueryThenFetchAsyncAction#doRun() sends the NODE_SEARCH_ACTION_NAME actions to nodes, the response handler's handleException() seems to run into problems if one of the nodes produces a NodeDisconnectedException.
Netty buffer leak
This is reproducible (see Steps to Reproduce section) by running the testSearchPhaseDisconnectionLeakIssue several times until the leak appears.
A potential solution that seemingly avoids the reproduction is to change
if (e instanceof SendRequestTransportException || cause instanceof TaskCancelledException) {
to
if (e instanceof ActionTransportException || cause instanceof TaskCancelledException) {
To confirm: whether onPhaseFailure should be called only once
This is an observation I made in the handleException()'s code that needs verification. Since there may be multiple actions sent, and multiple of them can fail, handleException() may be called multiple times, and can potentially call onPhaseFailure multiple times, which could end up completing the final listener multiple times?
Steps to Reproduce
Here's the diff patch on top of commit 4e68eccd9f663c2d4879f20c409444f3d7aa558b: 0001-Reproduction.patch . Run testSearchPhaseDisconnectionLeakIssue multiple times.
Logs (if relevant)
A build scan from a local run.