-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Pack dimension values in time-series aggregation #136216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
49bf9c8
to
25b5f7e
Compare
$if(BytesRef)$ | ||
public BytesRef getBytesRef(int position, BytesRef ignore) { | ||
public BytesRef getBytesRef(int position, BytesRef scratch) { | ||
scratch.bytes = value.bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will open a separate PR for this fix.
assertThat(Expressions.attribute(rate.field()).name(), equalTo("network.total_bytes_in")); | ||
LastOverTime lastSum = as(Alias.unwrap(aggsByTsid.aggregates().get(1)), LastOverTime.class); | ||
assertThat(Expressions.attribute(lastSum.field()).name(), equalTo("network.cost")); | ||
Values clusterValues = as(Alias.unwrap(aggsByTsid.aggregates().get(3)), Values.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update these tests.
return Math.max(INITIAL_SIZE_IN_BYTES, positionCount); | ||
} | ||
|
||
static BytesRefBlock packBytesValues(DriverContext driverContext, BytesRefBlock raw) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we encode multi-valued block into a single value.
} | ||
} | ||
|
||
static BytesRefBlock unpackBytesValues(DriverContext driverContext, BytesRefBlock encoded) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And decode here.
I will also verify this change with the competitive benchmark. |
c6a108c
to
d138266
Compare
7.203958127639015 | prod | [eu, us] | 2024-05-10T00:10:00.000Z | ||
6.34494062999877 | staging | us | 2024-05-10T00:10:00.000Z | ||
5.700488689624205 | prod | [eu, us] | 2024-05-10T00:20:00.000Z | ||
5.4539153439153445 | prod | [eu, us] | 2024-05-10T00:00:00.000Z |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We return arrays for grouping dimensions.
firstPassAggs.add(newFinalGroup); | ||
var valuesAgg = new Alias(g.source(), g.name(), new Values(g.source(), g)); | ||
firstPassAggs.add(valuesAgg); | ||
if (g.isDimension()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens with labels, i.e. grouping attributes that are neither dimensions nor metrics? I think multi-values were not allowed in dimensions for a long time, so this may be the case for some older configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so we want to pack not only dimensions but also labels. If that's the case, we might need a different approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks Nhat. Let's check the numbers on competitive benchmark, in case there are big surprises.
With this change, we pack the dimension into a single value before the second aggregation in time-series queries and unpack it afterward. This avoids generating permutations for multi-valued dimensions in the second aggregation, which is not desirable.
For example, the query
is rewritten as:
There is some overhead with packing and unpacking values, but we tried to isolate this behavior to time-series queries with dimension fields only. That is why I chose this approach.