Skip to content

SearchQueryThenFetchAsyncAction handleException method problems #137577

@kingherc

Description

@kingherc

Elasticsearch Version

main

Installed Plugins

No response

Java Version

bundled

OS Version

Darwin Kernel Version 24.6.0

Problem Description

I've run into this bug in an unrelated test, and since I got no Search expertise, I tried to do a reproduction test, which I mention in this ticket. When SearchQueryThenFetchAsyncAction#doRun() sends the NODE_SEARCH_ACTION_NAME actions to nodes, the response handler's handleException() seems to run into problems if one of the nodes produces a NodeDisconnectedException.

Netty buffer leak

This is reproducible (see Steps to Reproduce section) by running the testSearchPhaseDisconnectionLeakIssue several times until the leak appears.

A potential solution that seemingly avoids the reproduction is to change

if (e instanceof SendRequestTransportException || cause instanceof TaskCancelledException) {

to

if (e instanceof ActionTransportException || cause instanceof TaskCancelledException) {

To confirm: whether onPhaseFailure should be called only once

This is an observation I made in the handleException()'s code that needs verification. Since there may be multiple actions sent, and multiple of them can fail, handleException() may be called multiple times, and can potentially call onPhaseFailure multiple times, which could end up completing the final listener multiple times?

Steps to Reproduce

Here's the diff patch on top of commit 4e68eccd9f663c2d4879f20c409444f3d7aa558b: 0001-Reproduction.patch . Run testSearchPhaseDisconnectionLeakIssue multiple times.

Logs (if relevant)

A build scan from a local run.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions