Skip to content

Conversation

@nik9000
Copy link
Member

@nik9000 nik9000 commented Oct 30, 2025

Speeds up queries like

FROM foo
| STATS SUM(LENGTH(field))

by fusing the LENGTH into the loading of the field if it has doc values. Running a fairly simple test:
https://gist.github.com/nik9000/9dac067f8ce29875a4fb0f0359a75091 I'm seeing that query drop from 48ms to 28ms. So, like, 40% faster.

More importantly, this makes the mechanism for fusing functions into field loading generic. All you have to do is implement BlockLoaderExpression on your expression and return non-null from tryFuse.

Speeds up queries like
```
FROM foo
| STATS SUM(LENGTH(field))
```
by fusing the `LENGTH` into the loading of the `field` if it has doc
values. Running a fairly simple test:
https://gist.github.com/nik9000/9dac067f8ce29875a4fb0f0359a75091
I'm seeing that query drop from 48ms to 28ms. So, like, 40% faster.

More importantly, this makes the mechanism for fusing functions into
field loading generic. All you have to do is implement
`BlockLoaderExpression` on your expression and return non-null from
`tryFuse`.
@elasticsearchmachine
Copy link
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

* "fusing" the expression into the load. Or null if the fusion isn't possible.
*/
@Nullable
Fuse tryFuse(SearchStats stats);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to find another name - we already have Fuse as a command. ExpressionFieldLoader?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is FusedExpression ok? Or still too indicative?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming... 😅

I come from staring at FUSE enough that it carries a lot of weight.

For me, this feature involves BlockLoaders. And Expressions that are applied to them. I understand that fuse means getting together those two, but it's not something I would think of immediately without more context.

I'd prefer to be overly explicit here, and call this BlockLoaderExpression or something similar that helps me bridge those two concepts together. But, naming...

BlockLoaderExpression.Fuse fuse
) {
// Only replace if exactly one side is a literal and the other a field attribute
if ((similarityFunction.left() instanceof Literal ^ similarityFunction.right() instanceof Literal) == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It's much better to let the Expression deal with the details and make this generic 👍

*/
public boolean pushable() {
return true;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bothers me. I needed this because without it we'd try to push this:

FROM foo
| WHERE LENGTH(kwd) < 10

to the index. Now, we might be able to do that with a specialized lucene query. But we don't have one of those. Without those change instead what happens is:

  1. LENGTH(kwd) becomes $$kwd$length$hash$.
  2. We identify $$kwd$length$hash$ < 10 as pushable.

This tells us we can't push it. But it's kind of picky. If SearchStats took EsField it could check this easy enough. That might be a good solution to this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MultiTypeEsField is created with aggregatable=false, so that predicates on it don't get pushed down incorrectly.

Adding pushable should also work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding pushable should also work.

I'm going to see if I can do aggregatable=false

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just setting aggregatable to false doesn't do it. But I can return false from getExactInfo which seems to do the trick. I'm not entirely sure it's the best solution, but it doesn't invent a new thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But! I'm not sure that's right either. exact seems to be a concept we use at type resolution time - but I'm not sure why. It's a left-over from old QL that had a more useful meaning there.

I wonder if it'd be better to keep pushable and maybe rename to existsInEsIndex or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've flipped this to using exact and that does seem to work. Not sure if I like it more.

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Oct 31, 2025
Adds special purpose `BlockLoader` implementations for the `MV_MIN` and
`MV_MAX` functions for `keyword` fields with doc values. These are a
noop for single valued keywords but should be *much* faster for
multivalued keywords.

These aren't plugged in yet. We can plug them in and performance test
them in elastic#137382. And they give us two more functions we can use to
demonstrate elastic#137382.
}

public void testLengthInWhereAndEval() {
assumeFalse("fix me", true);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QL friends: This one looks fun!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that we get duplicated reference attributes here is that when PushExpressionsToFieldLoad creates a new FunctionEsField in EsRelation, it was generated under a specific command context, and it doesn't look at the the whole query plan level. So when the same LENGTH(last_name) is referenced in multiple commands in the query, duplicated FunctionEsFields are added into EsRelation.

ResolveUnionTypes has a very similar workflow. It iterates through the entire query plan to prepare the attributes added into EsRelation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, I'm rewriting this to look more like ResolveUnionTypes in #137392

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, I'm rewriting this to look more like ResolveUnionTypes in #137392

Should I wait for you to do that rewrite before merging this PR? Or will should I merge first and then you'll fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you! I'm addressing in #137564, but it still has to be reviewed. Feel free to merge this and I'll deal with integrating it.

@nik9000 nik9000 marked this pull request as ready for review October 31, 2025 19:52
@nik9000 nik9000 requested a review from carlosdelest October 31, 2025 19:52
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 31, 2025
}

public void testLengthInWhereAndEval() {
assumeFalse("fix me", true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that we get duplicated reference attributes here is that when PushExpressionsToFieldLoad creates a new FunctionEsField in EsRelation, it was generated under a specific command context, and it doesn't look at the the whole query plan level. So when the same LENGTH(last_name) is referenced in multiple commands in the query, duplicated FunctionEsFields are added into EsRelation.

ResolveUnionTypes has a very similar workflow. It iterates through the entire query plan to prepare the attributes added into EsRelation


public void testLengthInWhereAndEval() {
assumeFalse("fix me", true);
assumeTrue("requires similarity functions", EsqlCapabilities.Cap.VECTOR_SIMILARITY_FUNCTIONS_PUSHDOWN.isEnabled());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems work! That is an old capability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The capability came from the first PR with this. I figured there isn't really any externally visible thing for this so it's safe to reuse.

);
var name = rawTemporaryName(fieldAttr.name(), similarityFunction.nodeName(), String.valueOf(arrayHashCode));
FunctionEsField functionEsField = new FunctionEsField(fuse.field().field(), e.dataType(), fuse.config());
var name = rawTemporaryName(fuse.field().name(), fuse.config().name(), String.valueOf(fuse.config().hashCode()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason that we need a hash code in the new attribute's name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original assumption is that we needed the name to be the same for attributes that had the same field name and similarity vector, so I originally included the vector hash as part of the name.

Alex told me that this is not necessary, so I'll remove that in #137392.

@nik9000
Copy link
Member Author

nik9000 commented Nov 4, 2025

If there's enough values to trigger #137217 (comment) then it's slower. So, as always, @dnhatn was right. I'll make the sort, iter, uniq bit he suggested.

#137586

nik9000 added a commit that referenced this pull request Nov 4, 2025
Adds special purpose `BlockLoader` implementations for the `MV_MIN` and
`MV_MAX` functions for `keyword` fields with doc values. These are a
noop for single valued keywords but should be *much* faster for
multivalued keywords.

These aren't plugged in yet. We can plug them in and performance test
them in #137382. And they give us two more functions we can use to
demonstrate #137382.
@nik9000
Copy link
Member Author

nik9000 commented Nov 4, 2025

My integration test found that we're pushing to fields a little too aggressively. I'm adding an extra layer of "can the mapper do it" checks.

return new BytesRefsFromOrdsBlockLoader(name());
return switch (blContext.blockLoaderFunctionConfig()) {
case null -> new BytesRefsFromOrdsBlockLoader(name());
case BlockLoaderFunctionConfig.Named named -> switch (named.name()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we have an enum for this? We should avoid using strings. If too hard, at least make it a public static final String.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't really a limit to what we might push. We could make an enum in server with everything which would be ok, but there will be cases where the field that supports the push is outside of server too. So the only change in server would be the enum. That feels... wrong. Not sure it's worse than using a strong though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prevents us adding to the enum every time we push something new?
I thought this is not about the field, but about the name of the Function we are pushing? In this case the function name is LENGTH.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do if you like the idea. I think we'll want to switch it away from an enum - but it can wait until it's obvious. And maybe it'll just stay an enum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be OK with public static final String too, especially since you use it in two different places in this same file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, can't you just ask the blContext.blockLoaderFunctionConfig() to give you the appropriate BlockLoader and use inheritance? Then you don't even need case statement.

Copy link
Contributor

@julian-elastic julian-elastic Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to generalize this. If we support 50 pushable functions in the future, do we really want a case statement with 50 cases here? Probably no. So inheritance seems best

@Override
public boolean supportsBlockLoaderConfig(BlockLoaderFunctionConfig config, FieldExtractPreference preference) {
if (hasDocValues() && (preference != FieldExtractPreference.STORED || isSyntheticSourceEnabled())) {
return config.name().equals("LENGTH");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String again ...

* should work, but we need this check here to prevent actually pushing
* to a LOOKUP JOIN. If the field comes from a LOOKUP JOIN then it'll
* show up as missing here. And we can't push to those fields. Yet.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you handle the case where a field with the same name is both in the main index and the lookup index?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Badly, I bet.

);
if (EsqlCapabilities.Cap.VECTOR_SIMILARITY_FUNCTIONS_PUSHDOWN.isEnabled()) {
rules.add(new PushDownVectorSimilarityFunctions());
rules.add(new PushExpressionsToFieldLoad());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason for this to be a logical rule? I think it might make more sense to put in in the LocalPhysicalPlanOptimizer.
We have a bunch of other rules there that push to source, and this seems to be very low level source specific optimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @alex-spies had talked about this being logical. I'll do whatever y'all want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alex-spies What is the reasoning for this being a logical rule?

FieldAttribute lAttr = as(as(eval.fields().getFirst(), Alias.class).child(), FieldAttribute.class);
var limit = as(eval.child(), Limit.class);
var relation = as(limit.child(), EsRelation.class);
assertTrue(relation.output().contains(lAttr));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how do we know that anything got pushed down here?
Can we check that the attribute in the relation contains the Length? If not possible at least check that lAttr contains "LENGTH" in the name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I left that bit out. Important part of the test. Will add it.

assertThat(as(sum.field(), ReferenceAttribute.class).id(), equalTo(lAlias.id()));
assertThat(as(count.field(), ReferenceAttribute.class).id(), equalTo(lAlias.id()));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some cases with lookup join

FROM test
            | EVAL l = LENGTH(last_name)
            | LOOKUP JOIN table1 on field2
            | EVAL from_lookup = LENGTH(from_lookup)
            | KEEP l, from_lookup
            """;

Then add a more advanced case where from_lookup is actually in the main index too, but it gets shadowed with the field from the lookup table

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on lookup join.

I expect we'll want to have a more "obvious" way to filter this out of lookup join.


assertThat(as(sum.field(), ReferenceAttribute.class).id(), equalTo(lAlias.id()));
assertThat(as(count.field(), ReferenceAttribute.class).id(), equalTo(lAlias.id()));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also consider adding some cases where we use the same pushdown in many different ways

        FROM test
        | EVAL a1 = length(last_name), a2 = length(last_name), a3 = length(last_name), a4 = abs( length(last_name)) + a1 + length(first_name)*3
        | WHERE a1 > 1 and  length(last_name) > 1
        | STATS l = SUM(LENGTH(last_name)) + AVG(a3)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine that's going to be the same as the testLengthInWhereAndEval case. But I can add that too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I made it more interesting :)
Also we are pushing 2 different attributes.


record Named(String name, Warnings warnings) implements BlockLoaderFunctionConfig {
@Override
public int hashCode() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to not compare warnings in equals and hashcode?
Maybe leave a comment as to why not compare them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. They are mutable and not a key.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking if I need equals on this thing at all. I think it's enough to use the function enum I'm going to build.

@julian-elastic
Copy link
Contributor

I am done with my first round of code review, overall looks pretty good!
I think we should also add some csv cases to actually verify the data is correct with the push down.
And maybe a nightly performance test for both length and dense vector to demonstrate the improvement.

var relation = as(eval2.child(), EsRelation.class);
assertTrue(relation.output().contains(lAttr));

assertThat(as(sum.field(), ReferenceAttribute.class).id(), equalTo(lAlias.id()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we see if this works if there are some renames under?

FROM test 
| eval last_name1 = last_name
| eval last_name2 = last_name1
| eval last_name3 = last_name2
| EVAL a1 = length(last_name3)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants