-
Notifications
You must be signed in to change notification settings - Fork 25.7k
ESQL: Push == to text fields to lucene
#130387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hi @nik9000, I've created a changelog YAML for you. |
| return index.getValue() == Boolean.TRUE | ||
| && analyzers.indexAnalyzer.isConfigured() == false | ||
| && analyzers.searchAnalyzer.isConfigured() == false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
==SEARCH FOLKS LOOK HERE==
Is this paranoid enough? Like, could you break the match query with other settings I'm not checking? I'd love to default this to only working if there's nothing configured on the text field and then opt in configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is paranoid indeed. Just checking if the analyzer are the same is not enough since we have morphological analyzer that can tokenize differently depending on the surrounding text (Kuromoji, Nori, ..).
We could have a list of allowed analyzers but that sounds like a good start that will handle 99% of the use cases.
I can't think of another setting that would change this, are we matching the input as phrase query or as a conjunction of term?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conjunction of terms right now. I probably should do phrase to be honest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth building a field setting that overrides our choices? A true/false/null thing - true means "I know match is fine, push", false means "I know match isn't fine, never push" and null means "you figure it out".
since we have morphological analyzer that can tokenize differently depending on the surrounding text (Kuromoji, Nori, ..)
Can you give an example? So long as the same text tokenizes the same way that should be fine. We're not pushing "contains tokens" here - we're pushing exat string ==.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example? So long as the same text tokenizes the same way that should be fine. We're not pushing "contains tokens" here - we're pushing exat string ==.
Ah right sorry, the input must be exactly the same. So then another test, to remove the analyzer restriction, would be to use a memory index and verify that the search analyzer query can find the input indexed with the index analyzer?
Do you think it's worth building a field setting that overrides our choices?
I wonder if that should be pushed down to mappings directly. We have termQuery, phraseQuery defined by the field mappers now so we could delegate the decision there with an exactQuery? That would help with fields like match_only_text to avoid doing two verifications on the content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @javanna
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ESQL might indeed deserve some more query methods on MappedFieldType. I've been adding methods to, like, TextFieldMapper that help me decide what query to build. But I think it'd be cleaner to put these into the MappedFieldType. Also, it'd be more possible for, well, more fields to link into ESQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey, I follow high-level. Sounds like it would be cleaner to add the query generation to MappedFieldType. I am not quite sure what the default generated query would be though, but that's an implementation detail. I am also not clear on how that same method may be used in match only text, not familiar enough with it, but it all sounds like a good idea.
| String name = handler.nameOf(lhs); | ||
| if (pushdownPredicates.canUseEqualityOnSyntheticSourceDelegate(lhs, rhs)) { | ||
| return new SingleValueQuery(new EqualsSyntheticSourceDelegate(source(), name, rhs), name, true); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
==SEARCH FOLKS LOOK HERE==
The if statement blow makes that candidate matching query.
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Speeds up ESQL's
==ontextfields by building a lucene query to find candidates.In ESQL
a == "words cat words"ontextfields means "ahas exact the text 'words cat words'". We don't have a lucene query to matches this exactly but we can get it to make candidate documents that might match by running the query:That'll produce all the documents with the terms
wordsandcat. We can then run those documents through our compute engine's normal code to check==- it'll load thetextfields and run the real equality check.The other option - the thing we do without this PR - is to push nothing to lucene for this
==. If there's other filters we'll push them. And, sometimes, those are plenty! But in many cases we get a pretty big boost out of this candidates strategy.Quick and dirty best case testing yields numbers like:
The
tookthere is milliseconds. The queries areFROM test | WHERE text == \"words words 1\" | STATS COUNT(*)(optimized by this PR) andFROM test | WHERE NOT text == \"words words 1\" | STATS COUNT(*)(unchanged) without a time filter. Thedocuments_foundshows that the==comparison only needs to scan load 1000 documents butNOT ==needs to scan all of the million documents I put in the index.Relates to #128529
NOTE: This won't work for all
textfields. There's an unbelievable amount of configuration possible on these fields and it's quite possible for the"match": { "a": "words cat words" }query not to generate candidate for==. The easiest way to do this is to configure theindex_analyzerandsearch_analyzerin incompatible ways. But i'm sure there are tons of ways!