[DON'T MERGE] Proof of Concept: ES|QL approximate query execution #131828

jan-elastic · 2025-07-24T13:09:37Z

Proof of concept for approximate query execution

This is for gathering early feedback; not for merging!

This is targeting queries of the form

FROM data
  | commands_swappable_with_sample
  | STATS aggs
  | more_commands

Approximating rewrites it to

FROM data
  | SAMPLE probability
  | commands_swappable_with_sample
  | STATS sample_corrected_aggs
  | more_commands

The sample probability is such that the approximated results are based on ~1000 docs. It's determined via the total result count:

FROM data
  | commands_swappable_with_sample
  | STATS COUNT(*)

You can use this as follows

POST _query
{  
  "query": """
    FROM kibana_sample_data_ecommerce
     | STATS count=COUNT() BY CATEGORIZE(category)
     | SORT count DESC
  """,
  "approximate": true
}

With "approximate": false, the (correct) results are:

     count     |CATEGORIZE(category)
---------------+--------------------
3927           |.*?Clothing.*?      
2080           |.*?Shoes.*?         
1402           |.*?Accessories.*?

(based on "documents_found": 4675)

With "approximate": true, the (approxmiate) results are like:

     count     |CATEGORIZE(category)
---------------+--------------------
3791           |.*?Clothing.*?      
2001           |.*?Shoes.*?         
1533           |.*?Accessories.*?

(based on "documents_found": 990)

ivancea

Just a shallow check. To me, it makes sense. I would wait for somebody else to have another opinion though, in case this extra query could lead to something bad somewhere

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/approximate/Approximate.java

ivancea · 2025-07-24T14:25:53Z

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/approximate/Approximate.java

+     * off at the leftmost STATS function, followed by "| STATS COUNT(*)".
+     * This value can be used to pick a good sample probability.
+     */
+    public LogicalPlan countPlan() {


This extra query is probably my major "concern". It looks ok, but it's still going to execute evals and wheres, which could end up executing a full query anyway (?). It looks a bit "dangerous".

As an idea, I wonder if we could use some kind of Lucene statistics for this. I don't know if we have them though, or if what we have is enough. Even if they were just approximates, they could let us avoid this extra query, maybe. This would be another block of work though

I get your concern. That's exactly why I wanted some early feedback.

The extra query is pretty similar to the extra query of the inline join subplan though.

In the case of

FROM data | STATS COUNT()

I guess we can get the count directly from Lucene.

But for a more complicated

FROM data | WHERE my_function(x) < 1 | STATS COUNT()

that's obv not possible.

We can use sampling again though to get an approximate count, which is good enough for setting the probability.

x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/approximate/Approximate.java

jan-elastic marked this pull request as draft July 24, 2025 13:09

elasticsearchmachine added the v9.2.0 label Jul 24, 2025

jan-elastic force-pushed the esql-approximate branch from 4345891 to ee5caf5 Compare July 24, 2025 13:22

jan-elastic changed the title ~~[Proof of Concept] ES|QL approximate query execution~~ [DON'T MERGE] Proof of Concept: ES|QL approximate query execution Jul 24, 2025

ivancea reviewed Jul 24, 2025

View reviewed changes

jan-elastic force-pushed the esql-approximate branch 5 times, most recently from 319d98d to 36a55ec Compare July 30, 2025 13:55

jan-elastic force-pushed the esql-approximate branch from 36a55ec to 893b0f8 Compare August 1, 2025 10:56

jan-elastic force-pushed the esql-approximate branch 2 times, most recently from b00550a to 39eb164 Compare August 27, 2025 11:22

elasticsearchmachine added v9.3.0 and removed v9.2.0 labels Oct 2, 2025

jan-elastic force-pushed the esql-approximate branch 2 times, most recently from 380e7ac to e47f0db Compare October 7, 2025 09:16

ioanatia mentioned this pull request Oct 8, 2025

[CI] GenerativeForkIT test {csv-spec:string.ContainsFail} failing #136112

Closed

jan-elastic and others added 12 commits October 9, 2025 17:34

Approximate ESQL stats execution using 1000 documents

f983fee

refactor a bit

bb37a73

iteratively get sample probability

3f2840a

"Fix" JOIN/FORK/INLINESTATS

bfe0290

CSV tests

56ce4c5

close resources

c12daa8

better verification errors + tests

44d9a31

test row sampling behavior

4faeb04

add capability

8ee4bf6

remove debug

2d0925a

Add CSV test with STATS ... WHERE

f23ef28

[CI] Auto commit changes from spotless

7a73714

jan-elastic and others added 28 commits October 9, 2025 17:34

correct stats for bucketing

25437db

add empty buckets

6a7c619

improve whitelisting plans

cd8d7b4

move sample to front

b87ff6f

rename sampleId -> bucketId

e53fd3d

move final bucketId agg to the end

874ebda

Fix precision issue

ea746fb

seperate confidence interval column + fix to_string/date etc

470eb35

One column per bucket

e0d0594

Filter null buckets

eb9bf35

whitelist agg functions

fd6a227

Move sample correction to approximate class

85d1e61

disallow chained stats

1cfa14d

blacklist function that may output multivalued

b12edfc

Polish code + add documentation

8f65b6f

move ConfidenceInterval class

13fca52

fix CsvTests

3576702

whitelist supported processing commands

1793fd7

fix + extend ApproximateTests

ed6346f

[CI] Auto commit changes from spotless

61f9bf4

remove debug

ce0acd1

fix merge error

4af56e2

move+hide+improve confidence interval computation

4a6d540

Add reliable computation

308c449

more verification tests

29544c7

trials for confidence/reliable

cbee62b

optimize many_numbers test

e9f015b

spotless

790ade4

jan-elastic force-pushed the esql-approximate branch 2 times, most recently from 9dd8579 to 790ade4 Compare October 9, 2025 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DON'T MERGE] Proof of Concept: ES|QL approximate query execution #131828

[DON'T MERGE] Proof of Concept: ES|QL approximate query execution #131828

Uh oh!

jan-elastic commented Jul 24, 2025 •

edited

Loading

Uh oh!

ivancea left a comment

Uh oh!

Uh oh!

ivancea Jul 24, 2025

Uh oh!

jan-elastic Jul 25, 2025 •

edited

Loading

Uh oh!

jan-elastic Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

[DON'T MERGE] Proof of Concept: ES|QL approximate query execution #131828

Are you sure you want to change the base?

[DON'T MERGE] Proof of Concept: ES|QL approximate query execution #131828

Uh oh!

Conversation

jan-elastic commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proof of concept for approximate query execution

Uh oh!

ivancea left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ivancea Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

jan-elastic Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jan-elastic Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jan-elastic commented Jul 24, 2025 •

edited

Loading

jan-elastic Jul 25, 2025 •

edited

Loading