Skip to content

Conversation

@nik9000
Copy link
Member

@nik9000 nik9000 commented Jul 1, 2025

Speeds up ESQL's == on text fields by building a lucene query to find candidates.

In ESQL a == "words cat words" on text fields means "a has exact the text 'words cat words'". We don't have a lucene query to matches this exactly but we can get it to make candidate documents that might match by running the query:

"match": { "a": "words cat words" }

That'll produce all the documents with the terms words and cat. We can then run those documents through our compute engine's normal code to check == - it'll load the text fields and run the real equality check.

The other option - the thing we do without this PR - is to push nothing to lucene for this ==. If there's other filters we'll push them. And, sometimes, those are plenty! But in many cases we get a pretty big boost out of this candidates strategy.

Quick and dirty best case testing yields numbers like:

    text eq {"took": 17,"documents_found":   1000}
not text eq {"took":777,"documents_found":1000000}

The took there is milliseconds. The queries are FROM test | WHERE text == \"words words 1\" | STATS COUNT(*) (optimized by this PR) and FROM test | WHERE NOT text == \"words words 1\" | STATS COUNT(*) (unchanged) without a time filter. The documents_found shows that the == comparison only needs to scan load 1000 documents but NOT == needs to scan all of the million documents I put in the index.

Relates to #128529

NOTE: This won't work for all text fields. There's an unbelievable amount of configuration possible on these fields and it's quite possible for the "match": { "a": "words cat words" } query not to generate candidate for ==. The easiest way to do this is to configure the index_analyzer and search_analyzer in incompatible ways. But i'm sure there are tons of ways!

@elasticsearchmachine
Copy link
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

return index.getValue() == Boolean.TRUE
&& analyzers.indexAnalyzer.isConfigured() == false
&& analyzers.searchAnalyzer.isConfigured() == false;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

==SEARCH FOLKS LOOK HERE==

Is this paranoid enough? Like, could you break the match query with other settings I'm not checking? I'd love to default this to only working if there's nothing configured on the text field and then opt in configurations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is paranoid indeed. Just checking if the analyzer are the same is not enough since we have morphological analyzer that can tokenize differently depending on the surrounding text (Kuromoji, Nori, ..).
We could have a list of allowed analyzers but that sounds like a good start that will handle 99% of the use cases.
I can't think of another setting that would change this, are we matching the input as phrase query or as a conjunction of term?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conjunction of terms right now. I probably should do phrase to be honest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth building a field setting that overrides our choices? A true/false/null thing - true means "I know match is fine, push", false means "I know match isn't fine, never push" and null means "you figure it out".

since we have morphological analyzer that can tokenize differently depending on the surrounding text (Kuromoji, Nori, ..)

Can you give an example? So long as the same text tokenizes the same way that should be fine. We're not pushing "contains tokens" here - we're pushing exat string ==.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example? So long as the same text tokenizes the same way that should be fine. We're not pushing "contains tokens" here - we're pushing exat string ==.

Ah right sorry, the input must be exactly the same. So then another test, to remove the analyzer restriction, would be to use a memory index and verify that the search analyzer query can find the input indexed with the index analyzer?

Do you think it's worth building a field setting that overrides our choices?

I wonder if that should be pushed down to mappings directly. We have termQuery, phraseQuery defined by the field mappers now so we could delegate the decision there with an exactQuery? That would help with fields like match_only_text to avoid doing two verifications on the content?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ESQL might indeed deserve some more query methods on MappedFieldType. I've been adding methods to, like, TextFieldMapper that help me decide what query to build. But I think it'd be cleaner to put these into the MappedFieldType. Also, it'd be more possible for, well, more fields to link into ESQL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey, I follow high-level. Sounds like it would be cleaner to add the query generation to MappedFieldType. I am not quite sure what the default generated query would be though, but that's an implementation detail. I am also not clear on how that same method may be used in match only text, not familiar enough with it, but it all sounds like a good idea.

String name = handler.nameOf(lhs);
if (pushdownPredicates.canUseEqualityOnSyntheticSourceDelegate(lhs, rhs)) {
return new SingleValueQuery(new EqualsSyntheticSourceDelegate(source(), name, rhs), name, true);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

==SEARCH FOLKS LOOK HERE==

The if statement blow makes that candidate matching query.

@nik9000 nik9000 removed the v9.1.1 label Jul 1, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jul 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants