Improve cpu utilization with dynamic slice size in doc partitioning #132774

dnhatn · 2025-08-12T22:10:49Z

We have seen CPU underutilization in metrics queries against large indices when using either SEGMENT or DOC partitioning:

SEGMENT partitioning does not split large segments, so a single driver may process the entire query if most matching documents in a few segments.
DOC partitioning creates a fixed number of slices. If matching documents are concentrated in a few slices, a single driver may execute the entire query.

This PR introduces dynamic-sized partitioning for DOC to address CPU underutilization while keeping overhead small:

Partitioning starts with a desired partition size based on task_concurrency and caps the slice size at approximately 500K documents, preventing underutilization when matching documents are concentrated in one area.
For small and medium segments (less than five times the desired slice size), a variant of segment partitioning is used, which also splits segments larger than the desired size as needed.
To prevent multiple drivers from working on the same large segment unnecessarily, a single driver processes a segment sequentially until work-stealing occurs. This is accomplished by passing the current slice when polling for the next, allowing the queue to provide the next sequential slice from the same segment. New drivers are assigned slices from segments not currently being processed.

nik9000

Makes a lot of sense to me. So for DOC partitioning this generates more slices by capping the size of slices rather than trying to make as many slices as we have concurrency. And it mitigates the cost of that my having Drivers pluck a slice from the previous segment if possible. Actually - it looks like the way it used to work is that all drivers would try to work on a single segment together and then keep moving. Sort of, concentrating effort. This spreads out which segment is being worked on if possible. I quite like it.

I'm really curious what this does to DOC partitioning performance. The queries could still be a problem, but in the cases where the top level query is empty we kick this in by default.

nik9000 · 2025-08-15T16:29:30Z

x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/DataPartitioning.java

+     * Partitions into dynamic-sized slices to improve CPU utilization while keeping overhead low.
+     * This approach is more flexible than {@link #SEGMENT} and works as follows:
+     *
+     * <p>1. The slice size starts from a desired size based on {@code task_concurrency} but is capped


nik9000 · 2025-08-15T16:29:40Z

x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/LuceneOperator.java

            if (currentSlice == null || sliceIndex >= currentSlice.numLeaves()) {
                sliceIndex = 0;
-                currentSlice = sliceQueue.nextSlice();
+                currentSlice = sliceQueue.nextSlice(currentSlice);


nik9000 · 2025-08-15T16:43:21Z

x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/lucene/LuceneSliceQueue.java

-                }
-                return slices;
+                // Cap the desired slice to prevent CPU underutilization when matching documents are concentrated in one segment region.
+                int desiredSliceSize = Math.clamp(Math.ceilDiv(totalDocCount, requestedNumSlices), 1, MAX_DOCS_PER_SLICE);


I think we shouldn't call this requestedNumSlices any more - it's just taskConcurrency here. At least, we're not respecting the request for the number of slices - we absolutely got above it via MAX_DOCS_PER_SLICE.

++ updated in 50ebdd3

elasticsearchmachine · 2025-08-15T17:32:56Z

Hi @dnhatn, I've created a changelog YAML for you.

elasticsearchmachine · 2025-08-15T17:37:04Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

dnhatn · 2025-08-15T23:01:03Z

@nik9000 Thanks so much for the feedback + review!

With #132774, the overhead of running queries with DOC partitioning is small. While we might switch the default data partitioning to DOC for all queries in the future, this PR defaults data partitioning to DOC for time-series queries only to minimize any unexpected impact. Relates #132774

…lastic#132774) We have seen CPU underutilization in metrics queries against large indices when using either SEGMENT or DOC partitioning: 1. SEGMENT partitioning does not split large segments, so a single driver may process the entire query if most matching documents in a few segments. 2. DOC partitioning creates a fixed number of slices. If matching documents are concentrated in a few slices, a single driver may execute the entire query. This PR introduces dynamic-sized partitioning for DOC to address CPU underutilization while keeping overhead small: Partitioning starts with a desired partition size based on task_concurrency and caps the slice size at approximately 250K documents, preventing underutilization when matching documents are concentrated in one area. For small and medium segments (less than five times the desired slice size), a variant of segment partitioning is used, which also splits segments larger than the desired size as needed. To prevent multiple drivers from working on the same large segment unnecessarily, a single driver processes a segment sequentially until work-stealing occurs. This is accomplished by passing the current slice when polling for the next, allowing the queue to provide the next sequential slice from the same segment. New drivers are assigned slices from segments not currently being processed.

…33038) With elastic#132774, the overhead of running queries with DOC partitioning is small. While we might switch the default data partitioning to DOC for all queries in the future, this PR defaults data partitioning to DOC for time-series queries only to minimize any unexpected impact. Relates elastic#132774

With query and tags, SliceQueue will contain more slices (see #132512). This change introduces an additional priority for query heads, allowing Drivers to pull slices from the same query and segment first. This minimizes the overhead of switching between queries and segments. Relates #132774

elasticsearchmachine added the v9.2.0 label Aug 12, 2025

dnhatn force-pushed the lucene-slice-affinity branch 2 times, most recently from b3f46c6 to 4b52c0e Compare August 13, 2025 06:23

dnhatn changed the title ~~Minimize segment switching in LuceneSliceQueue~~ Cap docs per slice to 250K in doc partitioning Aug 13, 2025

dnhatn force-pushed the lucene-slice-affinity branch 3 times, most recently from ee117e1 to 2e44fad Compare August 13, 2025 22:01

dnhatn changed the title ~~Cap docs per slice to 250K in doc partitioning~~ Allow smaller slices in doc partitioning Aug 13, 2025

dnhatn changed the title ~~Allow smaller slices in doc partitioning~~ Improve cpu utilization with dynamic slice size in doc partitioning Aug 13, 2025

dnhatn force-pushed the lucene-slice-affinity branch 2 times, most recently from ba09dc3 to f52e21f Compare August 14, 2025 02:23

Improve cpu utilization with dynamic slice size in doc partitioning

1c7de75

dnhatn force-pushed the lucene-slice-affinity branch from f52e21f to 1c7de75 Compare August 14, 2025 02:28

dnhatn requested review from martijnvg and nik9000 August 14, 2025 05:00

nik9000 reviewed Aug 15, 2025

View reviewed changes

dnhatn added 3 commits August 15, 2025 09:59

rename

50ebdd3

javadoc

22d8230

Merge remote-tracking branch 'elastic/main' into lucene-slice-affinity

388eb39

dnhatn removed the request for review from martijnvg August 15, 2025 17:32

dnhatn added :Analytics/ES|QL AKA ESQL >enhancement labels Aug 15, 2025

Update docs/changelog/132774.yaml

f16fd98

dnhatn marked this pull request as ready for review August 15, 2025 17:36

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Aug 15, 2025

nik9000 approved these changes Aug 15, 2025

View reviewed changes

dnhatn merged commit f9cdaaf into elastic:main Aug 15, 2025
33 of 34 checks passed

dnhatn deleted the lucene-slice-affinity branch August 15, 2025 23:01

dnhatn mentioned this pull request Aug 17, 2025

Enable doc partitioning by default for time-series queries #133038

Merged

dnhatn mentioned this pull request Aug 20, 2025

Add query heads priority to SliceQueue #133245

Merged

dnhatn mentioned this pull request Aug 25, 2025

Do not share Weight between Drivers #133446

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve cpu utilization with dynamic slice size in doc partitioning #132774

Improve cpu utilization with dynamic slice size in doc partitioning #132774

Uh oh!

dnhatn commented Aug 12, 2025 •

edited

Loading

Uh oh!

nik9000 left a comment

Uh oh!

nik9000 Aug 15, 2025

Uh oh!

nik9000 Aug 15, 2025

Uh oh!

nik9000 Aug 15, 2025

Uh oh!

dnhatn Aug 15, 2025

Uh oh!

elasticsearchmachine commented Aug 15, 2025

Uh oh!

elasticsearchmachine commented Aug 15, 2025

Uh oh!

dnhatn commented Aug 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve cpu utilization with dynamic slice size in doc partitioning #132774

Improve cpu utilization with dynamic slice size in doc partitioning #132774

Uh oh!

Conversation

dnhatn commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nik9000 left a comment

Choose a reason for hiding this comment

Uh oh!

nik9000 Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

nik9000 Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

nik9000 Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

dnhatn Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Aug 15, 2025

Uh oh!

elasticsearchmachine commented Aug 15, 2025

Uh oh!

dnhatn commented Aug 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dnhatn commented Aug 12, 2025 •

edited

Loading