Skip to content

Conversation

@luigidellaquila
Copy link
Contributor

@luigidellaquila luigidellaquila commented Nov 13, 2025

Fixing TopHits memory management

Problems and TODO (spotted so far):

  • SearchContext.checkRealMemoryCB doesn't account for CB memory (always zero)
  • FetchPhase.buildSearchHits batches are too small, the memory buffer never accumulates enough to be tracked
  • We don't release CB
  • plumb TopHitsAggregator memory management lifecycle
  • Add tests triggering CB
  • plumb InnerHitsPhase memory management lifecycle
  • plumb SearchService.execute*Phase() memory management lifecycle
  • TopHitsAggregator.subSearchContext.closeFuture grows too much - this is due to this block, so it's irrelevant in prod.

Fixes: #136836

}
if (context.checkRealMemoryCB(locallyAccumulatedBytes[0], "fetch source")) {
// if we checked the real memory breaker, we restart our local accounting
locallyAccumulatedBytes[0] = 0;
Copy link
Contributor Author

@luigidellaquila luigidellaquila Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the time these batches were too small, so this didn't trigger.

}
RankFeatureShardPhase.prepareForFetch(searchContext, request);
fetchPhase.execute(searchContext, docIds, null);
fetchPhase.execute(searchContext, docIds, null, i -> {});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These did not account for memory before anyway, but now they could.
The hard part is releasing the CB; I don't see close+relase logic around here, and I'm not very familiar with this code. Maybe this could be a follow-up

docIdsToLoad[i] = topDocs.scoreDocs[i].doc;
}
FetchSearchResult fetchResult = runFetchPhase(subSearchContext, docIdsToLoad);
FetchSearchResult fetchResult = runFetchPhase(subSearchContext, docIdsToLoad, this::addRequestCircuitBreakerBytes);
Copy link
Contributor Author

@luigidellaquila luigidellaquila Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we batch here and avoid invoking the CB for every document?
Maybe addRequestCircuitBreakerBytes should take care of this?

I suspect that fetching source is way more expensive than invoking the CB, so I'm not sure we want more complication here.

@luigidellaquila luigidellaquila marked this pull request as ready for review November 14, 2025 08:56
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Nov 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @luigidellaquila, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

public final boolean checkRealMemoryCB(int locallyAccumulatedBytes, String label) {
if (locallyAccumulatedBytes >= memAccountingBufferSize()) {
circuitBreaker().addEstimateBytesAndMaybeBreak(0, label);
circuitBreaker().addEstimateBytesAndMaybeBreak(locallyAccumulatedBytes, label);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the crux

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/Aggregations Aggregations >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Should we be more aggressive with CB checks for TopHits source fetching?

2 participants