Skip to content

Conversation

@benchaplin
Copy link
Contributor

Resolves #122707.

Currently, we close the consumer (therefore decRef-ing all consumed shard results) once all shards in the batched query request are complete. I found that SearchWithRandomDisconnectsIT often causes the batched request to complete before all shards are done, leading to the leak with QuerySearchResults sitting in an un-closed consumer (to reproduce this locally, just run the suite with @Repeat(iterations = 100)).

I think we should mirror what is done on the coord node side - tie the consumer to the request listener (see AbstractSearchAsyncAction).

@benchaplin benchaplin added >non-issue auto-backport Automatically create backport pull requests when merged Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch :Search Foundations/Search Catch all for Search Foundations branch:9.2 branch:9.1 labels Nov 6, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

this.task = task;
this.countDown = new CountDown(queryPhaseResultConsumer.getNumShards());
this.channel = channel;
this.listener = ActionListener.releaseBefore(queryPhaseResultConsumer, new ChannelActionListener<>(channel));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach structurally looks better, as it centralizes the consumer’s lifecycle, ensuring that every success/failure path passes through the same wrapper.

Both approaches behave the same,

  • Serialize with an open consumer,
  • Close the consumer
  • Send response (or failure)
  • Let respondAndRelease to free the bytes

The code in main

  • Uses a local ChannelActionListener and a try (queryPhaseResultConsumer) block.
  • On success, the consumer is closed at the end of the try-with-resources (consumer) block (i.e., after serialization finishes, before building/sending the transport response).
  • On failure, the consumer is also closed by the try-with-resources and failure is sent via channelListener.

The pr's code:

  • Wraps the channel listener with releaseBefore(consumer, …) so the consumer is always closed before sending success/failure.
  • On success, the consumer is closed right before delegating to the channel (via wrapper). Serialization happens with consumer open; then the wrapper closes it and writes.
  • On failure, listener.onFailure(e) closes the consumer first (via wrapper) and then writes the failure.

I prefer the updated code, as releasing the consumer uniformly on both success and failure is cleaner; however, I’m not convinced it addresses the underlying issue.

I ran multiple local executions with @Repeat of SearchWithRandomDisconnectsIT#testSearchWithRandomDisconnects and was unable to reproduce the failure. Not sure if something changed or if it’s just my machine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, you're right, I'm struggling to reproduce on main myself now. I could have sworn I continued to see failures after my #136889 fix, but I might be mistaken - perhaps that solved it.

What do you think? I'm thinking to table this change, which might still be a worthy improvement, and just unmute SearchWithRandomDisconnectsIT for now. We can see if it's still failing in CI.

@benchaplin
Copy link
Contributor Author

Closing this for now as I've opened #137763 to simply unmute. This may be a worthy improvement for the future but not a priority. Shout out @drempapis for double checking me here.

@benchaplin benchaplin closed this Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >non-issue :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v9.1.8 v9.2.2 v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] SearchWithRandomDisconnectsIT testSearchWithRandomDisconnects failing

3 participants