Skip to content

Conversation

@ioanatia
Copy link
Contributor

Part of #123391 where we will keep track of any follow ups.

RRF is split into 3 parts:

  • RrfScoreEval receives a discriminator column. It will assign a score for each row based on the position in the subset.
  • Dedup is a SurrogateLogicalPlan that expands into
    STATS _score =SUM(_score), field1 = VALUES(field1), field2=VALUES(field2), ... BY _id, _index, where:
    • _score =SUM(_score) gives us the final RRF score
    • we dedup by grouping by _id and _index
    • field1, field2 ... are the rest of the available columns that are not _score, _id, _index and that we want to carry over
  • SORT BY _score, _id, _index DESC - so that we return the sorted results; we use _id and _index as a way to ensure the result order is deterministic (similar to what we do for _search).

The Dedup step is the one that needs more consideration - at this stage I grouped by _id and _index since it was the easiest ATM, but ideally we might want to use an internal search ID (that's composed by the shard ID + doc ID).
The other annoying aspect of grouping by _id and _index in the current implementation is that we require having METADATA _id, _index. It would be nice to evolve RRF to a place where this is not needed.

@ioanatia ioanatia added >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL v9.1.0 labels Feb 25, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @ioanatia, I've created a changelog YAML for you.

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this change look very concise and clean. I left a few small comments.

@ioanatia ioanatia marked this pull request as ready for review February 27, 2025 10:31
Copy link
Contributor

@tteofili tteofili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very clean impl, LGTM!


int rank = counters.getOrDefault(fork, 1);
counters.put(fork, rank + 1);
scores.appendDouble(1.0 / (60 + rank));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: this is currently configurable in _search, so we probably need to expose it as an option in the future here too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we need to make the rank constant configurable.
This is added as a separate feature in the meta issue #123391
It will require a syntax change for RRF, so I'd like to keep it separate for now.

Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good - I like the separation into individual pieces (dedup, operator, order).

I have some minor questions, and some error messages I think could be better

from test
| rrf
"""));
assertThat(e.getMessage(), containsString("Unknown column [_score]"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should provide a better error message for missing metadata attrs- something like "_score is needed for using RRF. Please add METADATA _score to your FROM command".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this - looks simple enough at a first glance. We can just modify this to have a custom error message when MetadataAttribute.isSupported(name) is true:

public static String errorMessage(String name, List<String> potentialMatches) {
String msg = "Unknown column [" + name + "]";
if (CollectionUtils.isEmpty(potentialMatches) == false) {
msg += ", did you mean "
+ (potentialMatches.size() == 1 ? "[" + potentialMatches.get(0) + "]" : "any of " + potentialMatches.toString())
+ "?";
}
return msg;
}

However we would return an error message like "Please add METADATA _score to your FROM command" even if you use ROW:

ROW a = 1, b = "two", c = null
| WHERE _score > 1

I know this is a very narrow corner case, but it would be an unintended behaviour.
It's not straighforward to get the context when we call UnresolvedAttribute.errorMessage whether the source command supports metadata attributes or not. So I think at most, we can look into this separately and not make the change here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be OK to error with "_score is needed for using RRF. Use FROM ... METADATA _score".

We can assume that full text search needs FROM, as FTFs need an index attribute to operate on?

We can refine this in a follow up, but it will be very confusing for users to receive "unknown column _score" - being a metadata attribute means users won't understand where's that coming from without referring to docs

Copy link
Contributor Author

@ioanatia ioanatia Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed - added as a follow up in #123391

public void testRrfError() {
assumeTrue("requires RRF capability", EsqlCapabilities.Cap.FORK.isEnabled());

var e = expectThrows(VerificationException.class, () -> analyze("""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't a explicit message like "FORK is needed before RRF" be added here so users have a clear understanding of the RRF usage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added that in RrfScoreEval by implementing PostAnalysisVerificationAware.
However the check for unresolved attributes is done before the PostAnalysisVerificationAware checks.
I don't want to add a check just for RRF in the Verifier before we do the unresolved attributes check:

Collection<Failure> verify(LogicalPlan plan, BitSet partialMetrics) {
assert partialMetrics != null;
Failures failures = new Failures();
// quick verification for unresolved attributes
checkUnresolvedAttributes(plan, failures);
// in case of failures bail-out as all other checks will be redundant
if (failures.hasFailures()) {
return failures.failures();
}
// collect plan checkers
var planCheckers = planCheckers(plan);

(planCheckers is looking for plans that implement PostAnalysisVerificationAware).

This deserves a bit more thought, so I am adding it as a follow up #123391

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable. Thanks for looking into this.

Copy link
Member

@fang-xing-esql fang-xing-esql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ioanatia, I added some questions about the usage of RRF that I can think of.


List<Order> order = List.of(
new Order(source, scoreAttr, Order.OrderDirection.DESC, Order.NullsPosition.LAST),
new Order(source, idAttr, Order.OrderDirection.ASC, Order.NullsPosition.LAST),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a specific reason that we decide to sort on _id before _index? Or the order of these two fields doesn't matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters - we just need a tiebreaker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if the two extra sort keys(_id and _index) are necessary, as longer sort key length may affect performance. ES|QL does not guarantee the order of the results unless an explicit sort is coded in the query, it is similar as SQL. This could be a potential performance related follow up, in case we see performance issue with RRF.

required_capability: match_operator_colon

FROM books METADATA _id, _index, _score
| FORK ( WHERE title:"Tolkien" | SORT _score DESC | LIMIT 3 )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can have some queries with disjunctions in the where clause of each fork leg that will be great, just to add a bit more complexity to make sure it works as expected. There are some queries with disjunctions in the match function and operator's csvtests, that can be used as a reference.

assertThat(e.getMessage(), containsString("Unknown column [_score]"));
assertThat(e.getMessage(), containsString("Unknown column [_fork]"));

e = expectThrows(VerificationException.class, () -> analyze("""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the sequence between FORK and RRF matters? For example if the sequence of fork and RRF is reversed, do we recognized it as a valid query?

| RRF
| FORK (WHERE a:"x")
       (WHERE a:"y")

Do we allow multiple fork or RRF, like below? Do they make sense? ES|QL does not prevent multiple occurrence of the same processing commands, commands like where, eval etc. can be used multiple times in the same query, is this also true for RRF and fork?

| FORK (WHERE a:"x")
       (WHERE a:"y")
| RRF
| RRF
or
| FORK (WHERE a:"x")
       (WHERE a:"y")
| RRF
| FORK (WHERE b:"x")
       (WHERE b:"y")
| RRF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have put a validation for RrfScoreEval such that we only allow RRF after a FORK command.
It might seem a bit extreme, but it makes sense in practice because while we might be able to execute the following queries, they don't make a lot of sense:

| RRF
| FORK (WHERE a:"x")
       (WHERE a:"y")

or

| FORK (WHERE a:"x")
       (WHERE a:"y")
| RRF
| RRF

Another thing to note is that we currently have a restriction for FORK where it's possible to only have a single FORK command in a query, so the following is not something we can do atm:

| FORK (WHERE a:"x")
       (WHERE a:"y")
| RRF
| FORK (WHERE b:"x")
       (WHERE b:"y")
| RRF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did see one thing that was concerning when I tried to do:

| FORK (WHERE a:"x")
       (WHERE a:"y")
| RRF
| RRF

this would lead to an unexecutable query because when do the RRF planning this expands to:

| FORK (WHERE a:"x")
       (WHERE a:"y")
| RrfScoreEval
| Dedup
| Sort
| RrfScoreEval
| Dedup
| Sort

The first SORT does not have a LIMIT so it cannot be translated to a TOP N.
I need to think more about this, not about supporting the case where we do RRF after RRF, but how to avoid this case of having unexecutable queries - I added it as a follow in #123391

@ioanatia ioanatia requested a review from carlosdelest March 7, 2025 13:52
| FORK ( WHERE emp_no:10001 )
( WHERE emp_no:10002 )
| RRF
| EVAL _score = round(_score, 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice trick! ❤️

Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 💯

Copy link
Member

@fang-xing-esql fang-xing-esql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!


List<Order> order = List.of(
new Order(source, scoreAttr, Order.OrderDirection.DESC, Order.NullsPosition.LAST),
new Order(source, idAttr, Order.OrderDirection.ASC, Order.NullsPosition.LAST),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if the two extra sort keys(_id and _index) are necessary, as longer sort key length may affect performance. ES|QL does not guarantee the order of the results unless an explicit sort is coded in the query, it is similar as SQL. This could be a potential performance related follow up, in case we see performance issue with RRF.

@ioanatia ioanatia merged commit cda8255 into elastic:main Mar 11, 2025
17 checks passed
@ioanatia ioanatia deleted the esql_rrf branch March 11, 2025 09:18
albertzaharovits pushed a commit to albertzaharovits/elasticsearch that referenced this pull request Mar 13, 2025
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Mar 13, 2025
@stratoula stratoula added the ES|QL-ui Impacts ES|QL UI label Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL ES|QL-ui Impacts ES|QL UI >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants