Skip to content

Conversation

jan-elastic
Copy link
Contributor

@jan-elastic jan-elastic commented Jul 24, 2025

Proof of concept for approximate query execution

This is for gathering early feedback; not for merging!

This is targeting queries of the form

FROM data
  | commands_swappable_with_sample
  | STATS aggs
  | more_commands

Approximating rewrites it to

FROM data
  | SAMPLE probability
  | commands_swappable_with_sample
  | STATS sample_corrected_aggs
  | more_commands

The sample probability is such that the approximated results are based on ~1000 docs. It's determined via the total result count:

FROM data
  | commands_swappable_with_sample
  | STATS COUNT(*)

You can use this as follows

POST _query
{  
  "query": """
    FROM kibana_sample_data_ecommerce
     | STATS count=COUNT() BY CATEGORIZE(category)
     | SORT count DESC
  """,
  "approximate": true
}

With "approximate": false, the (correct) results are:

     count     |CATEGORIZE(category)
---------------+--------------------
3927           |.*?Clothing.*?      
2080           |.*?Shoes.*?         
1402           |.*?Accessories.*?   

(based on "documents_found": 4675)

With "approximate": true, the (approxmiate) results are like:

     count     |CATEGORIZE(category)
---------------+--------------------
3791           |.*?Clothing.*?      
2001           |.*?Shoes.*?         
1533           |.*?Accessories.*?   

(based on "documents_found": 990)

@jan-elastic jan-elastic marked this pull request as draft July 24, 2025 13:09
@jan-elastic jan-elastic changed the title [Proof of Concept] ES|QL approximate query execution [DON'T MERGE] Proof of Concept: ES|QL approximate query execution Jul 24, 2025
Copy link
Contributor

@ivancea ivancea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a shallow check. To me, it makes sense. I would wait for somebody else to have another opinion though, in case this extra query could lead to something bad somewhere

* off at the leftmost STATS function, followed by "| STATS COUNT(*)".
* This value can be used to pick a good sample probability.
*/
public LogicalPlan countPlan() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extra query is probably my major "concern". It looks ok, but it's still going to execute evals and wheres, which could end up executing a full query anyway (?). It looks a bit "dangerous".

As an idea, I wonder if we could use some kind of Lucene statistics for this. I don't know if we have them though, or if what we have is enough. Even if they were just approximates, they could let us avoid this extra query, maybe. This would be another block of work though

Copy link
Contributor Author

@jan-elastic jan-elastic Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your concern. That's exactly why I wanted some early feedback.

The extra query is pretty similar to the extra query of the inline join subplan though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of

FROM data | STATS COUNT()

I guess we can get the count directly from Lucene.

But for a more complicated

FROM data | WHERE my_function(x) < 1 | STATS COUNT()

that's obv not possible.

We can use sampling again though to get an approximate count, which is good enough for setting the probability.

@jan-elastic jan-elastic force-pushed the esql-approximate branch 5 times, most recently from 319d98d to 36a55ec Compare July 30, 2025 13:55
@jan-elastic jan-elastic force-pushed the esql-approximate branch 2 times, most recently from b00550a to 39eb164 Compare August 27, 2025 11:22
@jan-elastic jan-elastic force-pushed the esql-approximate branch 2 times, most recently from 380e7ac to e47f0db Compare October 7, 2025 09:16
@jan-elastic jan-elastic force-pushed the esql-approximate branch 2 times, most recently from 9dd8579 to 790ade4 Compare October 9, 2025 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants