Skip to content

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Aug 12, 2025

We have seen CPU underutilization in metrics queries against large indices when using either SEGMENT or DOC partitioning:

  1. SEGMENT partitioning does not split large segments, so a single driver may process the entire query if most matching documents in a few segments.
  2. DOC partitioning creates a fixed number of slices. If matching documents are concentrated in a few slices, a single driver may execute the entire query.

This PR introduces dynamic-sized partitioning for DOC to address CPU underutilization while keeping overhead small:

  1. Partitioning starts with a desired partition size based on task_concurrency and caps the slice size at approximately 500K documents, preventing underutilization when matching documents are concentrated in one area.

  2. For small and medium segments (less than five times the desired slice size), a variant of segment partitioning is used, which also splits segments larger than the desired size as needed.

  3. To prevent multiple drivers from working on the same large segment unnecessarily, a single driver processes a segment sequentially until work-stealing occurs. This is accomplished by passing the current slice when polling for the next, allowing the queue to provide the next sequential slice from the same segment. New drivers are assigned slices from segments not currently being processed.

@dnhatn dnhatn force-pushed the lucene-slice-affinity branch 2 times, most recently from b3f46c6 to 4b52c0e Compare August 13, 2025 06:23
@dnhatn dnhatn changed the title Minimize segment switching in LuceneSliceQueue Cap docs per slice to 250K in doc partitioning Aug 13, 2025
@dnhatn dnhatn force-pushed the lucene-slice-affinity branch 3 times, most recently from ee117e1 to 2e44fad Compare August 13, 2025 22:01
@dnhatn dnhatn changed the title Cap docs per slice to 250K in doc partitioning Allow smaller slices in doc partitioning Aug 13, 2025
@dnhatn dnhatn changed the title Allow smaller slices in doc partitioning Improve cpu utilization with dynamic slice size in doc partitioning Aug 13, 2025
@dnhatn dnhatn force-pushed the lucene-slice-affinity branch 2 times, most recently from ba09dc3 to f52e21f Compare August 14, 2025 02:23
@dnhatn dnhatn force-pushed the lucene-slice-affinity branch from f52e21f to 1c7de75 Compare August 14, 2025 02:28
@dnhatn dnhatn requested review from martijnvg and nik9000 August 14, 2025 05:00
Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes a lot of sense to me. So for DOC partitioning this generates more slices by capping the size of slices rather than trying to make as many slices as we have concurrency. And it mitigates the cost of that my having Drivers pluck a slice from the previous segment if possible. Actually - it looks like the way it used to work is that all drivers would try to work on a single segment together and then keep moving. Sort of, concentrating effort. This spreads out which segment is being worked on if possible. I quite like it.

I'm really curious what this does to DOC partitioning performance. The queries could still be a problem, but in the cases where the top level query is empty we kick this in by default.

* Partitions into dynamic-sized slices to improve CPU utilization while keeping overhead low.
* This approach is more flexible than {@link #SEGMENT} and works as follows:
*
* <p>1. The slice size starts from a desired size based on {@code task_concurrency} but is capped
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<ol>

if (currentSlice == null || sliceIndex >= currentSlice.numLeaves()) {
sliceIndex = 0;
currentSlice = sliceQueue.nextSlice();
currentSlice = sliceQueue.nextSlice(currentSlice);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}
return slices;
// Cap the desired slice to prevent CPU underutilization when matching documents are concentrated in one segment region.
int desiredSliceSize = Math.clamp(Math.ceilDiv(totalDocCount, requestedNumSlices), 1, MAX_DOCS_PER_SLICE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't call this requestedNumSlices any more - it's just taskConcurrency here. At least, we're not respecting the request for the number of slices - we absolutely got above it via MAX_DOCS_PER_SLICE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ updated in 50ebdd3

@dnhatn dnhatn removed the request for review from martijnvg August 15, 2025 17:32
@elasticsearchmachine
Copy link
Collaborator

Hi @dnhatn, I've created a changelog YAML for you.

@dnhatn dnhatn marked this pull request as ready for review August 15, 2025 17:36
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Aug 15, 2025
@dnhatn
Copy link
Member Author

dnhatn commented Aug 15, 2025

@nik9000 Thanks so much for the feedback + review!

@dnhatn dnhatn merged commit f9cdaaf into elastic:main Aug 15, 2025
33 of 34 checks passed
@dnhatn dnhatn deleted the lucene-slice-affinity branch August 15, 2025 23:01
dnhatn added a commit that referenced this pull request Aug 18, 2025
With #132774, the overhead of running queries with DOC partitioning is 
small. While we might switch the default data partitioning to DOC for
all queries in the future, this PR defaults data partitioning to DOC for
time-series queries only to minimize any unexpected impact.


Relates #132774
javanna pushed a commit to javanna/elasticsearch that referenced this pull request Aug 18, 2025
…lastic#132774)

We have seen CPU underutilization in metrics queries against large 
indices when using either SEGMENT or DOC partitioning:

1. SEGMENT partitioning does not split large segments, so a single 
driver may process the entire query if most matching documents in a few
segments.

2. DOC partitioning creates a fixed number of slices. If matching 
documents are concentrated in a few slices, a single driver may execute
the entire query.

This PR introduces dynamic-sized partitioning for DOC to address CPU 
underutilization while keeping overhead small:

Partitioning starts with a desired partition size based on 
task_concurrency and caps the slice size at approximately 250K
documents, preventing underutilization when matching documents are
concentrated in one area.

For small and medium segments (less than five times the desired slice 
size), a variant of segment partitioning is used, which also splits
segments larger than the desired size as needed.

To prevent multiple drivers from working on the same large segment 
unnecessarily, a single driver processes a segment sequentially until
work-stealing occurs. This is accomplished by passing the current slice
when polling for the next, allowing the queue to provide the next
sequential slice from the same segment. New drivers are assigned slices
from segments not currently being processed.
javanna pushed a commit to javanna/elasticsearch that referenced this pull request Aug 18, 2025
…33038)

With elastic#132774, the overhead of running queries with DOC partitioning is 
small. While we might switch the default data partitioning to DOC for
all queries in the future, this PR defaults data partitioning to DOC for
time-series queries only to minimize any unexpected impact.


Relates elastic#132774
rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Aug 18, 2025
…33038)

With elastic#132774, the overhead of running queries with DOC partitioning is 
small. While we might switch the default data partitioning to DOC for
all queries in the future, this PR defaults data partitioning to DOC for
time-series queries only to minimize any unexpected impact.


Relates elastic#132774
dnhatn added a commit that referenced this pull request Aug 22, 2025
With query and tags, SliceQueue will contain more slices (see #132512). 
This change introduces an additional priority for query heads, allowing 
Drivers to pull slices from the same query and segment first. This
minimizes the overhead of switching between queries and segments.

Relates #132774
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants