-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Speed up COALESCE significantly #120139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up COALESCE significantly #120139
Conversation
``` before after (operation) Score Error Score Error Units coalesce_2_noop 75.949 ± 3.961 -> 0.010 ± 0.001 ns/op 99.9% coalesce_2_eager 99.299 ± 6.959 -> 4.292 ± 0.227 ns/op 95.7% coalesce_2_lazy 113.118 ± 5.747 -> 26.746 ± 0.954 ns/op 76.4% ``` We tend to advise folks that "COALESCE is faster than CASE", but, as of 8.16.0/elastic#112295 that wasn't the true. I was working with someone a few days ago to port a scripted_metric aggregation to ESQL and we saw COALESCE taking ~60% of the time. That won't do. The trouble is that CASE and COALESCE have to be *lazy*, meaning that operations like: ``` COALESCE(a, 1 / b) ``` should never emit a warning if `a` is not `null`, even if `b` is `0`. In 8.16/elastic#112295 CASE grew an optimization where it could operate non-lazily if it was flagged as "safe". This brings a similar optimization to COALESCE, see it above as "case_2_eager", a 95.7% improvement. It also brings and arguably more important optimization - entire-block execution for COALESCE. The schort version is that, if the first parameter of COALESCE returns no nulls we can return it without doing anything lazily. There are a few more cases, but the upshot is that COALESCE is pretyt much *free* in cases where long strings of results are `null` or not `null`. That's the `coalesce_2_noop` line. Finally, when there mixed null and non-null values we were using a single builder with some fairly inefficient paths. This specializes them per type and skips some slow null-checking where possible. That's the `coalesce_2_lazy` result, a more modest 76.4%.
Pinging @elastic/es-analytical-engine (Team:Analytics) |
benchmarks/README.md
Outdated
``` | ||
|
||
Note: As of January 2025 the latest release of async profiler doesn't work | ||
with out JDK but the nightly is fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/out/our/
* {@code endExclusive} into this builder. | ||
* <p> | ||
* For single position copies see {@link IntBlockBuilder#copyFrom(IntBlock, int)}, | ||
* {@link LongBlockBuilder#copyFrom(LongBlock, int)}, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update this to explain "because it is faster".
* </p> | ||
*/ | ||
@Override | ||
public $Type$BlockBuilder copyFrom($Type$Block block, int position$if(BytesRef)$, BytesRef scratch$endif$) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I basically just yanked this out of the regular copy code and made it easily callable for other tight loops. Folks will usually have the actual type when calling it so its going to get inlined. I think this could improve a bunch of places actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also helps that it makes you deal with scratch
if you are doing a BytesRef - if you are copying values one by one it's important to have your own scratch and not allocate one per.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
case INT -> ((IntBlockBuilder) builder).copyFrom((IntBlock) block, i); | ||
case LONG -> ((LongBlockBuilder) builder).copyFrom((LongBlock) block, i); | ||
default -> throw new IllegalArgumentException(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could intentionally didn't make a version of this thing in Block.Builder
because I:
- Want you to know if you need a scratch.
- Want you to specialize on each type in the tight loops.
case NULL -> EvalOperator.CONSTANT_NULL_FACTORY; | ||
case UNSUPPORTED, SHORT, BYTE, DATE_PERIOD, OBJECT, DOC_DATA_TYPE, SOURCE, TIME_DURATION, FLOAT, HALF_FLOAT, TSID_DATA_TYPE, | ||
SCALED_FLOAT, PARTIAL_AGG -> throw new UnsupportedOperationException("can't be coalesced"); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This switch
will fail to compile if you make a new type and don't include it. We intend for COALESCE to work for all types. If you are building a new type then you can stub it out as you go like you'll do with a few other evaluators.
* Inserts random non-null garbage <strong>around</strong> the expected data | ||
* and runs | ||
*/ | ||
public void testEvaluateWithGarbage() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is important for catching the case where your value is null
, but the rest of the block isn't null. I had an off-by-one error in the evaluators somewhere that the standard tests weren't catching and this does.
It's not clear to me if this is worth pulling to the top level class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a bunch of format/style comments, but approved otherwise.
case "coalesce_2_noop", "coalesce_2_eager", "coalesce_2_lazy" -> { | ||
FieldAttribute f1 = longField(); | ||
FieldAttribute f2 = longField(); | ||
Expression lhs = f1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using ternary expression here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes this one harder to read. The if
statement screams "wrap in this case" to me.
).get(driverContext); | ||
String desc = operation.endsWith("lazy") ? "CaseLazyEvaluator" : "CaseEagerEvaluator"; | ||
if (evaluator.toString().contains(desc) == false) { | ||
throw new IllegalArgumentException("Evaluator was [" + evaluator + "] but expected one containing [" + desc + "]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using Strings.format
for these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find that harder to read these days.
Builder copyFrom($Type$Block block, int beginInclusive, int endExclusive); | ||
|
||
/** | ||
* Copy the values in {@code block} at {@code position}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this copy values
or value
(since you only give it a single position). Also, consider adding a note here about the usage of scratch
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a bigger comment. It copies all of the values at this position. If it has one value, it'll copy that. If this has many values, it'll copy that. If this is null
, it'll copy that.
* </p> | ||
*/ | ||
@Override | ||
public $Type$BlockBuilder copyFrom($Type$Block block, int position$if(BytesRef)$, BytesRef scratch$endif$) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
|
||
@Override | ||
public String toString() { | ||
return "ConstantNull"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps extract this string to a constant, so it's clear it matches the one below it.
CoalesceBytesRefEvaluator.toEvaluator(toEvaluator, children()); | ||
case NULL -> EvalOperator.CONSTANT_NULL_FACTORY; | ||
case UNSUPPORTED, SHORT, BYTE, DATE_PERIOD, OBJECT, DOC_DATA_TYPE, SOURCE, TIME_DURATION, FLOAT, HALF_FLOAT, TSID_DATA_TYPE, | ||
SCALED_FLOAT, PARTIAL_AGG -> throw new UnsupportedOperationException("can't be coalesced"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the type to the exception message.
try (Block.Builder builder = elementType.newBlockBuilder(positions, context.blockFactory())) { | ||
for (int p = 0; p < positions; p++) { | ||
if (p == realPosition) { | ||
builder.copyFrom(onePositionPage.getBlock(b), 0, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be extracted to builder.copyFrom(p == realPosition ? something : somethingElse, 0, 1);
.
The first argument can also be extract to a local variable which will probably make the code take fewer lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll flip it some, yeah. fewer lines doesn't really matter here, but I think it'll be easier to read with the first argument pulled to a variable.
} | ||
|
||
/** | ||
* Inserts random non-null garbage <strong>around</strong> the expected data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc doesn't not need to be 2 lines.
Block[] manyPositionsBlocks = new Block[Math.toIntExact(data.stream().filter(d -> d.isForceLiteral() == false).count())]; | ||
int realPosition = between(0, positions - 1); | ||
try { | ||
int b = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename b
to index
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming. b
means to me "the index variable for the array who's name stats with b
", but of course, the array didn't start with b
. I'll rename.
if (p == realPosition) { | ||
builder.copyFrom(onePositionPage.getBlock(b), 0, 1); | ||
} else { | ||
builder.copyFrom( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's way too much nesting here! Perhaps extracting this to a helper function would help readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hadn't pushed me to my nesting limit, but if it has you I can flip it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing. I wonder in how many places, we can do something like this. Something to be looking for for sure
* </p> | ||
*/ | ||
@Override | ||
public $Type$BlockBuilder copyFrom($Type$Block block, int position$if(BytesRef)$, BytesRef scratch$endif$) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be final? It looks like a dangerous performance regression if it was overridden at some point (I think?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class itself is final.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm, we don't have this project generated-src
in .gitattributes
, can you add it? 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤘
Mostly CASE. Otherwise we get this by using normal evaluators most of the time. |
``` before after (operation) Score Error Score Error Units coalesce_2_noop 75.949 ± 3.961 -> 0.010 ± 0.001 ns/op 99.9% coalesce_2_eager 99.299 ± 6.959 -> 4.292 ± 0.227 ns/op 95.7% coalesce_2_lazy 113.118 ± 5.747 -> 26.746 ± 0.954 ns/op 76.4% ``` We tend to advise folks that "COALESCE is faster than CASE", but, as of 8.16.0/elastic#112295 that wasn't the true. I was working with someone a few days ago to port a scripted_metric aggregation to ESQL and we saw COALESCE taking ~60% of the time. That won't do. The trouble is that CASE and COALESCE have to be *lazy*, meaning that operations like: ``` COALESCE(a, 1 / b) ``` should never emit a warning if `a` is not `null`, even if `b` is `0`. In 8.16/elastic#112295 CASE grew an optimization where it could operate non-lazily if it was flagged as "safe". This brings a similar optimization to COALESCE, see it above as "case_2_eager", a 95.7% improvement. It also brings and arguably more important optimization - entire-block execution for COALESCE. The schort version is that, if the first parameter of COALESCE returns no nulls we can return it without doing anything lazily. There are a few more cases, but the upshot is that COALESCE is pretty much *free* in cases where long strings of results are `null` or not `null`. That's the `coalesce_2_noop` line. Finally, when there mixed null and non-null values we were using a single builder with some fairly inefficient paths. This specializes them per type and skips some slow null-checking where possible. That's the `coalesce_2_lazy` result, a more modest 76.4%. NOTE: These %s of improvements on COALESCE itself, or COALESCE with some load-overhead operators like `+`. If COALESCE isn't taking a *ton* time in your query don't get particularly excited about this. It's fun though. Closes elastic#119953
💚 Backport successful
|
``` before after (operation) Score Error Score Error Units coalesce_2_noop 75.949 ± 3.961 -> 0.010 ± 0.001 ns/op 99.9% coalesce_2_eager 99.299 ± 6.959 -> 4.292 ± 0.227 ns/op 95.7% coalesce_2_lazy 113.118 ± 5.747 -> 26.746 ± 0.954 ns/op 76.4% ``` We tend to advise folks that "COALESCE is faster than CASE", but, as of 8.16.0/#112295 that wasn't the true. I was working with someone a few days ago to port a scripted_metric aggregation to ESQL and we saw COALESCE taking ~60% of the time. That won't do. The trouble is that CASE and COALESCE have to be *lazy*, meaning that operations like: ``` COALESCE(a, 1 / b) ``` should never emit a warning if `a` is not `null`, even if `b` is `0`. In 8.16/#112295 CASE grew an optimization where it could operate non-lazily if it was flagged as "safe". This brings a similar optimization to COALESCE, see it above as "case_2_eager", a 95.7% improvement. It also brings and arguably more important optimization - entire-block execution for COALESCE. The schort version is that, if the first parameter of COALESCE returns no nulls we can return it without doing anything lazily. There are a few more cases, but the upshot is that COALESCE is pretty much *free* in cases where long strings of results are `null` or not `null`. That's the `coalesce_2_noop` line. Finally, when there mixed null and non-null values we were using a single builder with some fairly inefficient paths. This specializes them per type and skips some slow null-checking where possible. That's the `coalesce_2_lazy` result, a more modest 76.4%. NOTE: These %s of improvements on COALESCE itself, or COALESCE with some load-overhead operators like `+`. If COALESCE isn't taking a *ton* time in your query don't get particularly excited about this. It's fun though. Closes #119953
We tend to advise folks that "COALESCE is faster than CASE", but, as of
8.16.0/#112295 that wasn't the true. I was working with someone a few
days ago to port a scripted_metric aggregation to ESQL and we saw
COALESCE taking ~60% of the time. That won't do.
The trouble is that CASE and COALESCE have to be lazy, meaning that
operations like:
should never emit a warning if
a
is notnull
, even ifb
is0
. In8.16/#112295 CASE grew an optimization where it could operate non-lazily
if it was flagged as "safe". This brings a similar optimization to
COALESCE, see it above as "case_2_eager", a 95.7% improvement.
It also brings and arguably more important optimization - entire-block
execution for COALESCE. The schort version is that, if the first
parameter of COALESCE returns no nulls we can return it without doing
anything lazily. There are a few more cases, but the upshot is that
COALESCE is pretyt much free in cases where long strings of results
are
null
or notnull
. That's thecoalesce_2_noop
line.Finally, when there mixed null and non-null values we were using a
single builder with some fairly inefficient paths. This specializes them
per type and skips some slow null-checking where possible. That's the
coalesce_2_lazy
result, a more modest 76.4%.NOTE: These %s of improvements on COALESCE itself, or COALESCE with some load-overhead operators like
+
. If COALESCE isn't taking a ton time in your query don't get particularly excited about this. It's fun though.Closes #119953