Skip to content

Conversation

@carlosdelest
Copy link
Member

@carlosdelest carlosdelest commented Jun 12, 2025

KNN is a different kind of function than other text functions in ESQL - it retrieves the top K results instead of returning all the results dictated by LIMIT.

In order not to return less results than the user expects, and to optimize the underlying KNN search, KNN function will use the LIMIT(s) before it to set the k parameter, in case it is not set already as an option by the user.

FROM test
| WHERE KNN(vector, [0, 1, 2])
| LIMIT 100

will mean that KNN will set k to 100.

A LIMIT that comes later from KNN is taken into account:

FROM test METADATA _score
| WHERE KNN(vector, [0, 1, 2])
| EVAL t = x + 1
| SORT _score
| LIMIT 100

will use k = 100 as well.

The only commands that stop the LIMIT to KNN propagation is STATS.

If there are multiple LIMIT clauses, the lesser one will be applied:

FROM test METADATA _score
| WHERE KNN(vector, [0, 1, 2])
| LIMIT 20
| EVAL t = x + 1
| SORT _score
| LIMIT 10

will have k = 10.

If the user has already specified a k value in the KNN function via the k option, the k option will prevail vs using a limit:

FROM test
| WHERE KNN(vector, [0, 1, 2],  {k: 10})
| LIMIT 100

will hava k = 10.

k needs to be specified either implicitly (via LIMIT) or explicitly (via the k option). We have a default LIMIT of 1000 that is applied implicitly, except for example in the case of STATS:

FROM test
| WHERE KNN(vector, [0, 1, 2])
| STATS c = count(*)

the query above will fail as there is no default LIMIT applied and k option is not set.

Open questions:

  • Should we just make K a mandatory param and don't depend on LIMIT?
  • Should KNN use K as an optional param instead of an option?
  • Should the default LIMIT of 1000 be considered an error for KNN, and force the user to either set a LIMIT or set K in KNN?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed these tests to:

  • Remove the "k" option and replace it by LIMIT
  • Remove STATS use cases, as I've removed the ability to use KNN in STATS for now

* @param failures failures found
*/
private static void checkFullTextQueryFunctions(LogicalPlan plan, Failures failures) {
private void checkFullTextQueryFunctions(LogicalPlan plan, Failures failures) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed visibility of these methods to allow overriding in KNN

new EsqlCCSUtils.CssPartialErrorsActionListener(executionInfo, listener) {
@Override
public void onResponse(LogicalPlan analyzedPlan) {
LogicalPlan optimizedPlan = optimizedPlan(analyzedPlan);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed PreMapper to be done after the plan has been optimized. This way, KNN can update k before the QueryBuilder is created, and so it takes the k coming from the optimization

* KNN should not be used in aggregations, as it is a top-N query and not a filtering query
*/
@Override
protected void checkFullTextFunctionsInAggs(Aggregate agg, Failures failures) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the ability to use KNN in aggs. It does not make sense to me - KNN retrieves a top K in terms of similarity, not as a filtering function like other FTFs.

We can undo that decision later, for example forcing it to use a minimum similarity - but still think that it's too much of an edge case as of now, and complicates testing.

@benwtrent
Copy link
Member

FROM test METADATA _score
| WHERE KNN(vector, [0, 1, 2])
| LIMIT 20
| EVAL t = x + 1
| SORT _score
| LIMIT 10

Having knn take the lesser one doesn't make sense to me. Why shouldn't it take the first one?

As a user, I would assume it takes the first one, then allows me to do whatever I want with the table of k:20, then further limit based on some subsequent actions.

@carlosdelest
Copy link
Member Author

carlosdelest commented Jun 13, 2025

Having knn take the lesser one doesn't make sense to me. Why shouldn't it take the first one?

As a user, I would assume it takes the first one, then allows me to do whatever I want with the table of k:20, then further limit based on some subsequent actions.

@benwtrent In case there's a LIMIT that comes after that reduces the resultset, then whatever the user does before would get lost when applying the lesser LIMIT afterwards.

Given the following:

| WHERE KNN(..)
| LIMIT 20
| EVAL k = x +1
| LIMIT 10

Then it doesn't matter what I calculated in the bottom 10 rows as it will be discarded by the last LIMIT 10.

This comes from an optimization already done for pushing down and combining limits - I wanted to keep the same semantics for KNN, as the returned rows that don't pass a posterior LIMIT would be discarded as well.

The only case I can think of is if we're changing the SORT order between limits - I don't see that taken into account in the push down limits optimization, I'll think about it more.

@benwtrent
Copy link
Member

@carlosdelest yes, sort order changing is exactly my concern

@carlosdelest
Copy link
Member Author

Done as part of #132944

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants