Skip to content

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Jan 14, 2025

                      before              after
     (operation)   Score   Error       Score   Error  Units
 coalesce_2_noop  75.949 ± 3.961  ->   0.010 ±  0.001 ns/op  99.9%
coalesce_2_eager  99.299 ± 6.959  ->   4.292 ±  0.227 ns/op  95.7%
 coalesce_2_lazy 113.118 ± 5.747  ->  26.746 ±  0.954 ns/op  76.4%

We tend to advise folks that "COALESCE is faster than CASE", but, as of
8.16.0/#112295 that wasn't the true. I was working with someone a few
days ago to port a scripted_metric aggregation to ESQL and we saw
COALESCE taking ~60% of the time. That won't do.

The trouble is that CASE and COALESCE have to be lazy, meaning that
operations like:

COALESCE(a, 1 / b)

should never emit a warning if a is not null, even if b is 0. In
8.16/#112295 CASE grew an optimization where it could operate non-lazily
if it was flagged as "safe". This brings a similar optimization to
COALESCE, see it above as "case_2_eager", a 95.7% improvement.

It also brings and arguably more important optimization - entire-block
execution for COALESCE. The schort version is that, if the first
parameter of COALESCE returns no nulls we can return it without doing
anything lazily. There are a few more cases, but the upshot is that
COALESCE is pretyt much free in cases where long strings of results
are null or not null. That's the coalesce_2_noop line.

Finally, when there mixed null and non-null values we were using a
single builder with some fairly inefficient paths. This specializes them
per type and skips some slow null-checking where possible. That's the
coalesce_2_lazy result, a more modest 76.4%.

NOTE: These %s of improvements on COALESCE itself, or COALESCE with some load-overhead operators like +. If COALESCE isn't taking a ton time in your query don't get particularly excited about this. It's fun though.

Closes #119953

```
                      before              after
     (operation)   Score   Error       Score   Error  Units
 coalesce_2_noop  75.949 ± 3.961  ->   0.010 ±  0.001 ns/op  99.9%
coalesce_2_eager  99.299 ± 6.959  ->   4.292 ±  0.227 ns/op  95.7%
 coalesce_2_lazy 113.118 ± 5.747  ->  26.746 ±  0.954 ns/op  76.4%
```

We tend to advise folks that "COALESCE is faster than CASE", but, as of
8.16.0/elastic#112295 that wasn't the true. I was working with someone a few
days ago to port a scripted_metric aggregation to ESQL and we saw
COALESCE taking ~60% of the time. That won't do.

The trouble is that CASE and COALESCE have to be *lazy*, meaning that
operations like:
```
COALESCE(a, 1 / b)
```
should never emit a warning if `a` is not `null`, even if `b` is `0`. In
8.16/elastic#112295 CASE grew an optimization where it could operate non-lazily
if it was flagged as "safe". This brings a similar optimization to
COALESCE, see it above as "case_2_eager", a 95.7% improvement.

It also brings and arguably more important optimization - entire-block
execution for COALESCE. The schort version is that, if the first
parameter of COALESCE returns no nulls we can return it without doing
anything lazily. There are a few more cases, but the upshot is that
COALESCE is pretyt much *free* in cases where long strings of results
are `null` or not `null`. That's the `coalesce_2_noop` line.

Finally, when there mixed null and non-null values we were using a
single builder with some fairly inefficient paths. This specializes them
per type and skips some slow null-checking where possible. That's the
`coalesce_2_lazy` result, a more modest 76.4%.
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jan 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

```

Note: As of January 2025 the latest release of async profiler doesn't work
with out JDK but the nightly is fine.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/out/our/

* {@code endExclusive} into this builder.
* <p>
* For single position copies see {@link IntBlockBuilder#copyFrom(IntBlock, int)},
* {@link LongBlockBuilder#copyFrom(LongBlock, int)}, etc.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update this to explain "because it is faster".

* </p>
*/
@Override
public $Type$BlockBuilder copyFrom($Type$Block block, int position$if(BytesRef)$, BytesRef scratch$endif$) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically just yanked this out of the regular copy code and made it easily callable for other tight loops. Folks will usually have the actual type when calling it so its going to get inlined. I think this could improve a bunch of places actually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also helps that it makes you deal with scratch if you are doing a BytesRef - if you are copying values one by one it's important to have your own scratch and not allocate one per.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

case INT -> ((IntBlockBuilder) builder).copyFrom((IntBlock) block, i);
case LONG -> ((LongBlockBuilder) builder).copyFrom((LongBlock) block, i);
default -> throw new IllegalArgumentException();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could intentionally didn't make a version of this thing in Block.Builder because I:

  1. Want you to know if you need a scratch.
  2. Want you to specialize on each type in the tight loops.

case NULL -> EvalOperator.CONSTANT_NULL_FACTORY;
case UNSUPPORTED, SHORT, BYTE, DATE_PERIOD, OBJECT, DOC_DATA_TYPE, SOURCE, TIME_DURATION, FLOAT, HALF_FLOAT, TSID_DATA_TYPE,
SCALED_FLOAT, PARTIAL_AGG -> throw new UnsupportedOperationException("can't be coalesced");
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This switch will fail to compile if you make a new type and don't include it. We intend for COALESCE to work for all types. If you are building a new type then you can stub it out as you go like you'll do with a few other evaluators.

* Inserts random non-null garbage <strong>around</strong> the expected data
* and runs
*/
public void testEvaluateWithGarbage() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important for catching the case where your value is null, but the rest of the block isn't null. I had an off-by-one error in the evaluators somewhere that the standard tests weren't catching and this does.

It's not clear to me if this is worth pulling to the top level class.

@nik9000 nik9000 added the auto-backport Automatically create backport pull requests when merged label Jan 16, 2025
Copy link
Contributor

@GalLalouche GalLalouche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a bunch of format/style comments, but approved otherwise.

case "coalesce_2_noop", "coalesce_2_eager", "coalesce_2_lazy" -> {
FieldAttribute f1 = longField();
FieldAttribute f2 = longField();
Expression lhs = f1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using ternary expression here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes this one harder to read. The if statement screams "wrap in this case" to me.

).get(driverContext);
String desc = operation.endsWith("lazy") ? "CaseLazyEvaluator" : "CaseEagerEvaluator";
if (evaluator.toString().contains(desc) == false) {
throw new IllegalArgumentException("Evaluator was [" + evaluator + "] but expected one containing [" + desc + "]");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using Strings.format for these.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that harder to read these days.

Builder copyFrom($Type$Block block, int beginInclusive, int endExclusive);

/**
* Copy the values in {@code block} at {@code position}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this copy values or value (since you only give it a single position). Also, consider adding a note here about the usage of scratch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a bigger comment. It copies all of the values at this position. If it has one value, it'll copy that. If this has many values, it'll copy that. If this is null, it'll copy that.

* </p>
*/
@Override
public $Type$BlockBuilder copyFrom($Type$Block block, int position$if(BytesRef)$, BytesRef scratch$endif$) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.


@Override
public String toString() {
return "ConstantNull";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps extract this string to a constant, so it's clear it matches the one below it.

CoalesceBytesRefEvaluator.toEvaluator(toEvaluator, children());
case NULL -> EvalOperator.CONSTANT_NULL_FACTORY;
case UNSUPPORTED, SHORT, BYTE, DATE_PERIOD, OBJECT, DOC_DATA_TYPE, SOURCE, TIME_DURATION, FLOAT, HALF_FLOAT, TSID_DATA_TYPE,
SCALED_FLOAT, PARTIAL_AGG -> throw new UnsupportedOperationException("can't be coalesced");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the type to the exception message.

try (Block.Builder builder = elementType.newBlockBuilder(positions, context.blockFactory())) {
for (int p = 0; p < positions; p++) {
if (p == realPosition) {
builder.copyFrom(onePositionPage.getBlock(b), 0, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be extracted to builder.copyFrom(p == realPosition ? something : somethingElse, 0, 1);.
The first argument can also be extract to a local variable which will probably make the code take fewer lines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll flip it some, yeah. fewer lines doesn't really matter here, but I think it'll be easier to read with the first argument pulled to a variable.

}

/**
* Inserts random non-null garbage <strong>around</strong> the expected data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc doesn't not need to be 2 lines.

Block[] manyPositionsBlocks = new Block[Math.toIntExact(data.stream().filter(d -> d.isForceLiteral() == false).count())];
int realPosition = between(0, positions - 1);
try {
int b = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename b to index or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming. b means to me "the index variable for the array who's name stats with b", but of course, the array didn't start with b. I'll rename.

if (p == realPosition) {
builder.copyFrom(onePositionPage.getBlock(b), 0, 1);
} else {
builder.copyFrom(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's way too much nesting here! Perhaps extracting this to a helper function would help readability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hadn't pushed me to my nesting limit, but if it has you I can flip it.

Copy link
Contributor

@ivancea ivancea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing. I wonder in how many places, we can do something like this. Something to be looking for for sure

* </p>
*/
@Override
public $Type$BlockBuilder copyFrom($Type$Block block, int position$if(BytesRef)$, BytesRef scratch$endif$) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be final? It looks like a dangerous performance regression if it was overridden at some point (I think?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class itself is final.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, we don't have this project generated-src in .gitattributes, can you add it? 👀

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤘

@nik9000
Copy link
Member Author

nik9000 commented Jan 23, 2025

Amazing. I wonder in how many places, we can do something like this. Something to be looking for for sure

Mostly CASE. Otherwise we get this by using normal evaluators most of the time.

@nik9000 nik9000 enabled auto-merge (squash) January 23, 2025 17:09
@nik9000 nik9000 merged commit dc4fa26 into elastic:main Jan 23, 2025
16 checks passed
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jan 23, 2025
```
                      before              after
     (operation)   Score   Error       Score   Error  Units
 coalesce_2_noop  75.949 ± 3.961  ->   0.010 ±  0.001 ns/op  99.9%
coalesce_2_eager  99.299 ± 6.959  ->   4.292 ±  0.227 ns/op  95.7%
 coalesce_2_lazy 113.118 ± 5.747  ->  26.746 ±  0.954 ns/op  76.4%
```

We tend to advise folks that "COALESCE is faster than CASE", but, as of
8.16.0/elastic#112295 that wasn't the true. I was working with someone a few
days ago to port a scripted_metric aggregation to ESQL and we saw
COALESCE taking ~60% of the time. That won't do.

The trouble is that CASE and COALESCE have to be *lazy*, meaning that
operations like:
```
COALESCE(a, 1 / b)
```
should never emit a warning if `a` is not `null`, even if `b` is `0`. In
8.16/elastic#112295 CASE grew an optimization where it could operate non-lazily
if it was flagged as "safe". This brings a similar optimization to
COALESCE, see it above as "case_2_eager", a 95.7% improvement.

It also brings and arguably more important optimization - entire-block
execution for COALESCE. The schort version is that, if the first
parameter of COALESCE returns no nulls we can return it without doing
anything lazily. There are a few more cases, but the upshot is that
COALESCE is pretty much *free* in cases where long strings of results
are `null` or not `null`. That's the `coalesce_2_noop` line.

Finally, when there mixed null and non-null values we were using a
single builder with some fairly inefficient paths. This specializes them
per type and skips some slow null-checking where possible. That's the
`coalesce_2_lazy` result, a more modest 76.4%.

NOTE: These %s of improvements on COALESCE itself, or COALESCE with some load-overhead operators like `+`. If COALESCE isn't taking a *ton* time in your query don't get particularly excited about this. It's fun though.

Closes elastic#119953
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

elasticsearchmachine pushed a commit that referenced this pull request Jan 23, 2025
```
                      before              after
     (operation)   Score   Error       Score   Error  Units
 coalesce_2_noop  75.949 ± 3.961  ->   0.010 ±  0.001 ns/op  99.9%
coalesce_2_eager  99.299 ± 6.959  ->   4.292 ±  0.227 ns/op  95.7%
 coalesce_2_lazy 113.118 ± 5.747  ->  26.746 ±  0.954 ns/op  76.4%
```

We tend to advise folks that "COALESCE is faster than CASE", but, as of
8.16.0/#112295 that wasn't the true. I was working with someone a few
days ago to port a scripted_metric aggregation to ESQL and we saw
COALESCE taking ~60% of the time. That won't do.

The trouble is that CASE and COALESCE have to be *lazy*, meaning that
operations like:
```
COALESCE(a, 1 / b)
```
should never emit a warning if `a` is not `null`, even if `b` is `0`. In
8.16/#112295 CASE grew an optimization where it could operate non-lazily
if it was flagged as "safe". This brings a similar optimization to
COALESCE, see it above as "case_2_eager", a 95.7% improvement.

It also brings and arguably more important optimization - entire-block
execution for COALESCE. The schort version is that, if the first
parameter of COALESCE returns no nulls we can return it without doing
anything lazily. There are a few more cases, but the upshot is that
COALESCE is pretty much *free* in cases where long strings of results
are `null` or not `null`. That's the `coalesce_2_noop` line.

Finally, when there mixed null and non-null values we were using a
single builder with some fairly inefficient paths. This specializes them
per type and skips some slow null-checking where possible. That's the
`coalesce_2_lazy` result, a more modest 76.4%.

NOTE: These %s of improvements on COALESCE itself, or COALESCE with some load-overhead operators like `+`. If COALESCE isn't taking a *ton* time in your query don't get particularly excited about this. It's fun though.

Closes #119953
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: Port CaseEagerEvaluator style optimization to COALESCE`
4 participants