-
Notifications
You must be signed in to change notification settings - Fork 25.6k
ES|QL: Add initial grammar and planning for RRF (snapshot) #123396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Hi @ioanatia, I've created a changelog YAML for you. |
ChrisHegarty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this change look very concise and clean. I left a few small comments.
...ugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/RrfScoreEvalOperator.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/parser/StatementParserTests.java
Show resolved
Hide resolved
tteofili
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very clean impl, LGTM!
|
|
||
| int rank = counters.getOrDefault(fork, 1); | ||
| counters.put(fork, rank + 1); | ||
| scores.appendDouble(1.0 / (60 + rank)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: this is currently configurable in _search, so we probably need to expose it as an option in the future here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, we need to make the rank constant configurable.
This is added as a separate feature in the meta issue #123391
It will require a syntax change for RRF, so I'd like to keep it separate for now.
...ck/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/LocalExecutionPlanner.java
Outdated
Show resolved
Hide resolved
carlosdelest
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good - I like the separation into individual pieces (dedup, operator, order).
I have some minor questions, and some error messages I think could be better
...ugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/RrfScoreEvalOperator.java
Outdated
Show resolved
Hide resolved
| from test | ||
| | rrf | ||
| """)); | ||
| assertThat(e.getMessage(), containsString("Unknown column [_score]")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should provide a better error message for missing metadata attrs- something like "_score is needed for using RRF. Please add METADATA _score to your FROM command".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this - looks simple enough at a first glance. We can just modify this to have a custom error message when MetadataAttribute.isSupported(name) is true:
Lines 113 to 121 in a5e0423
| public static String errorMessage(String name, List<String> potentialMatches) { | |
| String msg = "Unknown column [" + name + "]"; | |
| if (CollectionUtils.isEmpty(potentialMatches) == false) { | |
| msg += ", did you mean " | |
| + (potentialMatches.size() == 1 ? "[" + potentialMatches.get(0) + "]" : "any of " + potentialMatches.toString()) | |
| + "?"; | |
| } | |
| return msg; | |
| } |
However we would return an error message like "Please add METADATA _score to your FROM command" even if you use ROW:
ROW a = 1, b = "two", c = null
| WHERE _score > 1
I know this is a very narrow corner case, but it would be an unintended behaviour.
It's not straighforward to get the context when we call UnresolvedAttribute.errorMessage whether the source command supports metadata attributes or not. So I think at most, we can look into this separately and not make the change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be OK to error with "_score is needed for using RRF. Use FROM ... METADATA _score".
We can assume that full text search needs FROM, as FTFs need an index attribute to operate on?
We can refine this in a follow up, but it will be very confusing for users to receive "unknown column _score" - being a metadata attribute means users won't understand where's that coming from without referring to docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed - added as a follow up in #123391
| public void testRrfError() { | ||
| assumeTrue("requires RRF capability", EsqlCapabilities.Cap.FORK.isEnabled()); | ||
|
|
||
| var e = expectThrows(VerificationException.class, () -> analyze(""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't a explicit message like "FORK is needed before RRF" be added here so users have a clear understanding of the RRF usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added that in RrfScoreEval by implementing PostAnalysisVerificationAware.
However the check for unresolved attributes is done before the PostAnalysisVerificationAware checks.
I don't want to add a check just for RRF in the Verifier before we do the unresolved attributes check:
elasticsearch/x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/Verifier.java
Lines 75 to 88 in b663616
| Collection<Failure> verify(LogicalPlan plan, BitSet partialMetrics) { | |
| assert partialMetrics != null; | |
| Failures failures = new Failures(); | |
| // quick verification for unresolved attributes | |
| checkUnresolvedAttributes(plan, failures); | |
| // in case of failures bail-out as all other checks will be redundant | |
| if (failures.hasFailures()) { | |
| return failures.failures(); | |
| } | |
| // collect plan checkers | |
| var planCheckers = planCheckers(plan); |
(planCheckers is looking for plans that implement PostAnalysisVerificationAware).
This deserves a bit more thought, so I am adding it as a follow up #123391
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's reasonable. Thanks for looking into this.
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/AnalyzerTests.java
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/parser/LogicalPlanBuilder.java
Show resolved
Hide resolved
fang-xing-esql
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ioanatia, I added some questions about the usage of RRF that I can think of.
x-pack/plugin/esql/qa/testFixtures/src/main/resources/rrf.csv-spec
Outdated
Show resolved
Hide resolved
|
|
||
| List<Order> order = List.of( | ||
| new Order(source, scoreAttr, Order.OrderDirection.DESC, Order.NullsPosition.LAST), | ||
| new Order(source, idAttr, Order.OrderDirection.ASC, Order.NullsPosition.LAST), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is a specific reason that we decide to sort on _id before _index? Or the order of these two fields doesn't matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it matters - we just need a tiebreaker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if the two extra sort keys(_id and _index) are necessary, as longer sort key length may affect performance. ES|QL does not guarantee the order of the results unless an explicit sort is coded in the query, it is similar as SQL. This could be a potential performance related follow up, in case we see performance issue with RRF.
| required_capability: match_operator_colon | ||
|
|
||
| FROM books METADATA _id, _index, _score | ||
| | FORK ( WHERE title:"Tolkien" | SORT _score DESC | LIMIT 3 ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can have some queries with disjunctions in the where clause of each fork leg that will be great, just to add a bit more complexity to make sure it works as expected. There are some queries with disjunctions in the match function and operator's csvtests, that can be used as a reference.
| assertThat(e.getMessage(), containsString("Unknown column [_score]")); | ||
| assertThat(e.getMessage(), containsString("Unknown column [_fork]")); | ||
|
|
||
| e = expectThrows(VerificationException.class, () -> analyze(""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the sequence between FORK and RRF matters? For example if the sequence of fork and RRF is reversed, do we recognized it as a valid query?
| RRF
| FORK (WHERE a:"x")
(WHERE a:"y")
Do we allow multiple fork or RRF, like below? Do they make sense? ES|QL does not prevent multiple occurrence of the same processing commands, commands like where, eval etc. can be used multiple times in the same query, is this also true for RRF and fork?
| FORK (WHERE a:"x")
(WHERE a:"y")
| RRF
| RRF
or
| FORK (WHERE a:"x")
(WHERE a:"y")
| RRF
| FORK (WHERE b:"x")
(WHERE b:"y")
| RRF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have put a validation for RrfScoreEval such that we only allow RRF after a FORK command.
It might seem a bit extreme, but it makes sense in practice because while we might be able to execute the following queries, they don't make a lot of sense:
| RRF
| FORK (WHERE a:"x")
(WHERE a:"y")
or
| FORK (WHERE a:"x")
(WHERE a:"y")
| RRF
| RRF
Another thing to note is that we currently have a restriction for FORK where it's possible to only have a single FORK command in a query, so the following is not something we can do atm:
| FORK (WHERE a:"x")
(WHERE a:"y")
| RRF
| FORK (WHERE b:"x")
(WHERE b:"y")
| RRF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did see one thing that was concerning when I tried to do:
| FORK (WHERE a:"x")
(WHERE a:"y")
| RRF
| RRF
this would lead to an unexecutable query because when do the RRF planning this expands to:
| FORK (WHERE a:"x")
(WHERE a:"y")
| RrfScoreEval
| Dedup
| Sort
| RrfScoreEval
| Dedup
| Sort
The first SORT does not have a LIMIT so it cannot be translated to a TOP N.
I need to think more about this, not about supporting the case where we do RRF after RRF, but how to avoid this case of having unexecutable queries - I added it as a follow in #123391
| | FORK ( WHERE emp_no:10001 ) | ||
| ( WHERE emp_no:10002 ) | ||
| | RRF | ||
| | EVAL _score = round(_score, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice trick! ❤️
carlosdelest
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 💯
fang-xing-esql
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
|
|
||
| List<Order> order = List.of( | ||
| new Order(source, scoreAttr, Order.OrderDirection.DESC, Order.NullsPosition.LAST), | ||
| new Order(source, idAttr, Order.OrderDirection.ASC, Order.NullsPosition.LAST), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if the two extra sort keys(_id and _index) are necessary, as longer sort key length may affect performance. ES|QL does not guarantee the order of the results unless an explicit sort is coded in the query, it is similar as SQL. This could be a potential performance related follow up, in case we see performance issue with RRF.
Part of #123391 where we will keep track of any follow ups.
RRFis split into 3 parts:RrfScoreEvalreceives a discriminator column. It will assign a score for each row based on the position in the subset.Dedupis aSurrogateLogicalPlanthat expands intoSTATS _score =SUM(_score), field1 = VALUES(field1), field2=VALUES(field2), ... BY _id, _index, where:_score =SUM(_score)gives us the final RRF score_idand_indexfield1,field2... are the rest of the available columns that are not_score,_id,_indexand that we want to carry overSORT BY _score, _id, _index DESC- so that we return the sorted results; we use_idand_indexas a way to ensure the result order is deterministic (similar to what we do for_search).The
Dedupstep is the one that needs more consideration - at this stage I grouped by_idand_indexsince it was the easiest ATM, but ideally we might want to use an internal search ID (that's composed by the shard ID + doc ID).The other annoying aspect of grouping by
_idand_indexin the current implementation is that we require havingMETADATA _id, _index. It would be nice to evolve RRF to a place where this is not needed.