-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New execution model for rates #134267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New execution model for rates #134267
Conversation
6ba4367 to
681169a
Compare
|
|
||
| import java.util.List; | ||
|
|
||
| public final class RateDoubleGroupingAggregatorFunction implements GroupingAggregatorFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These new rate classes are the main change; the rest removes the specialized execution path for rates.
681169a to
0fab629
Compare
| /** | ||
| * {@link GroupingAggregatorFunction} implementation for {@link OldRateDoubleAggregator}. | ||
| */ | ||
| public final class OldRateDoubleGroupingAggregatorFunction implements GroupingAggregatorFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was moved because Javadoc does not work with code-generated classes. This class will be removed in a follow-up.
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
| } | ||
| return; | ||
| } | ||
| class Slice { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: move outside the method as a private class for readability.
.../src/main/java/org/elasticsearch/compute/aggregation/RateLongGroupingAggregatorFunction.java
Outdated
Show resolved
Hide resolved
| pq.updateTop(); | ||
| } | ||
| var val = buffer.values.get(position); | ||
| reset += dv(val, prevValue) + dv(prevValue, lastValue) - dv(val, lastValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused with this one.. Mind adding a comment about how this works? I'd think (naively) that we only need to keep adding dv(val, prevValue) to the delta as that one detects resets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this formula allows tracking (a) the delta before resetting, (b) using 0 as the new low bound for values. If so, worth documenting in a comment.
| } | ||
| } | ||
|
|
||
| PriorityQueue<Slice> pq = new PriorityQueue<>(buffer.sliceOffsets.length + 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the definition and initialization of the priority queue can also be moved into Slice for readability.
| return state; | ||
| } | ||
|
|
||
| static class Buffer implements Releasable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a comment outlining how the buffer works, i.e. storing multiple slices in a single array and tracking the start offset of each one.
| int newSize = totalCount + count; | ||
| timestamps = bigArrays.grow(timestamps, newSize); | ||
| values = bigArrays.grow(values, newSize); | ||
| if (totalCount > 0 && firstTimestamp > timestamps.get(totalCount - 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that timestamps appearing out of order indicate the start of a new slice.
| }; | ||
| { | ||
| int startOffset = 0; | ||
| for (int sliceOffset : buffer.sliceOffsets) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this loop seems to belong more to Buffer :)
It can return an array of slices to be fed then to the priority queue.
|
|
||
| @Override | ||
| public void evaluateIntermediate(Block[] blocks, int offset, IntVector selected) { | ||
| flushBuffers(selected); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm my understanding, this runs in a single thread per shard per bucket, after we collect all the data from all segments for this bucket?
kkrik-es
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really, really cool - and you get to clean up so much special logic as a bonus. Just some nits to improve readability.
The comment about in-mem buffering is legit. I wonder if we can fall back to the slow path if we estimate that the buffers will exceed e.g. 1% of available memory.
|
Thanks Kostas! I have addressed all of your comments. |
Our current execution path for rate aggregation, particularly for calculating counter resets, requires data in each bucket to be strictly ordered by timestamp. This necessitates a specialized execution path for time-series data, which I believe is unnecessarily complex.
I've explored an alternative model: instead of pre-sorting data in the source operator, we buffer data in each bucket and perform a merge-sort just before emitting the output. This would eliminate the need for specialized time-series code and allow us to leverage existing ES|QL optimizations.
The main downside is the memory usage for buffering rate points. Each data point requires about 16 bytes; typical queries over a few million points would use less than 100MB, but worst-case scenarios could consume up to 32GB, potentially causing circuit breaking errors.
We can mitigate this with the following enhancements:
Execute segments in descending max_timestamp order: By processing segments this way, the source operator can provide a "high-water mark" (the maximum timestamp that may appear in the current or subsequent segments). This allows the rate aggregator to safely flush any buffered data that is more recent than this mark, keeping the buffer size minimal or avoid buffering data points.
Dynamically split shards by time interval: For large time ranges with interleaved data, we can partition execution into smaller time intervals based on min and max timestamps. This limits buffer size and improves parallelism.
This PR is the first step, it cut over from the current to the new rate with buffer. This new rate still delegates to the old rate after merging the buffer.
I benchmarked this change for the below query, the executiton time reduced from 405ms -> 270ms.
I expect more to come with
high-water mark.Relates #134324