-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Improve cpu utilization with dynamic slice size in doc partitioning #132774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b3f46c6 to
4b52c0e
Compare
ee117e1 to
2e44fad
Compare
ba09dc3 to
f52e21f
Compare
f52e21f to
1c7de75
Compare
nik9000
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes a lot of sense to me. So for DOC partitioning this generates more slices by capping the size of slices rather than trying to make as many slices as we have concurrency. And it mitigates the cost of that my having Drivers pluck a slice from the previous segment if possible. Actually - it looks like the way it used to work is that all drivers would try to work on a single segment together and then keep moving. Sort of, concentrating effort. This spreads out which segment is being worked on if possible. I quite like it.
I'm really curious what this does to DOC partitioning performance. The queries could still be a problem, but in the cases where the top level query is empty we kick this in by default.
| * Partitions into dynamic-sized slices to improve CPU utilization while keeping overhead low. | ||
| * This approach is more flexible than {@link #SEGMENT} and works as follows: | ||
| * | ||
| * <p>1. The slice size starts from a desired size based on {@code task_concurrency} but is capped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<ol>
| if (currentSlice == null || sliceIndex >= currentSlice.numLeaves()) { | ||
| sliceIndex = 0; | ||
| currentSlice = sliceQueue.nextSlice(); | ||
| currentSlice = sliceQueue.nextSlice(currentSlice); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| } | ||
| return slices; | ||
| // Cap the desired slice to prevent CPU underutilization when matching documents are concentrated in one segment region. | ||
| int desiredSliceSize = Math.clamp(Math.ceilDiv(totalDocCount, requestedNumSlices), 1, MAX_DOCS_PER_SLICE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't call this requestedNumSlices any more - it's just taskConcurrency here. At least, we're not respecting the request for the number of slices - we absolutely got above it via MAX_DOCS_PER_SLICE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ updated in 50ebdd3
|
Hi @dnhatn, I've created a changelog YAML for you. |
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
@nik9000 Thanks so much for the feedback + review! |
…lastic#132774) We have seen CPU underutilization in metrics queries against large indices when using either SEGMENT or DOC partitioning: 1. SEGMENT partitioning does not split large segments, so a single driver may process the entire query if most matching documents in a few segments. 2. DOC partitioning creates a fixed number of slices. If matching documents are concentrated in a few slices, a single driver may execute the entire query. This PR introduces dynamic-sized partitioning for DOC to address CPU underutilization while keeping overhead small: Partitioning starts with a desired partition size based on task_concurrency and caps the slice size at approximately 250K documents, preventing underutilization when matching documents are concentrated in one area. For small and medium segments (less than five times the desired slice size), a variant of segment partitioning is used, which also splits segments larger than the desired size as needed. To prevent multiple drivers from working on the same large segment unnecessarily, a single driver processes a segment sequentially until work-stealing occurs. This is accomplished by passing the current slice when polling for the next, allowing the queue to provide the next sequential slice from the same segment. New drivers are assigned slices from segments not currently being processed.
…33038) With elastic#132774, the overhead of running queries with DOC partitioning is small. While we might switch the default data partitioning to DOC for all queries in the future, this PR defaults data partitioning to DOC for time-series queries only to minimize any unexpected impact. Relates elastic#132774
…33038) With elastic#132774, the overhead of running queries with DOC partitioning is small. While we might switch the default data partitioning to DOC for all queries in the future, this PR defaults data partitioning to DOC for time-series queries only to minimize any unexpected impact. Relates elastic#132774
We have seen CPU underutilization in metrics queries against large indices when using either SEGMENT or DOC partitioning:
This PR introduces dynamic-sized partitioning for DOC to address CPU underutilization while keeping overhead small:
Partitioning starts with a desired partition size based on task_concurrency and caps the slice size at approximately 500K documents, preventing underutilization when matching documents are concentrated in one area.
For small and medium segments (less than five times the desired slice size), a variant of segment partitioning is used, which also splits segments larger than the desired size as needed.
To prevent multiple drivers from working on the same large segment unnecessarily, a single driver processes a segment sequentially until work-stealing occurs. This is accomplished by passing the current slice when polling for the next, allowing the queue to provide the next sequential slice from the same segment. New drivers are assigned slices from segments not currently being processed.