ESQL: `text ==` and `text !=` pushdown #127355

nik9000 · 2025-04-24T20:44:12Z

Reenables text == pushdown and adds support for text != pushdown.

It does so by making TranslationAware#translatable return something we can turn into a tri-valued function. It has these values:

YES
NO
RECHECK

YES means the Expression is entirely pushable into Lucene. They will be pushed into Lucene and removed from the plan.

NO means the Expression can't be pushed to Lucene at all and will stay in the plan.

RECHECK mean the Expression can push a query that makes candidate matches but must be rechecked. Documents that don't match the query won't match the expression, but documents that match the query might not match the expression.
These are pushed to Lucene and left in the plan.

This is required because txt != "b" can build a candidate query against the txt.keyword subfield but it can't be sure of the match without loading the _source - which we do in the compute engine.

I haven't plugged rally into this, but here's some basic performance tests:

Before:
not text eq {"took":460,"documents_found":1000000}
    text eq {"took":432,"documents_found":1000000}

After:
    text eq {"took":5,"documents_found":1}
not text eq {"took":351,"documents_found":800000}

This comes from:

rm -f /tmp/bulk*
for a in {1..1000}; do
    echo '{"index":{}}' >> /tmp/bulk
    echo '{"text":"text '$(printf $(($a % 5)))'"}' >> /tmp/bulk
done
ls -l /tmp/bulk*

passwd="redacted"
curl -sk -uelastic:$passwd -HContent-Type:application/json -XDELETE https://localhost:9200/test
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPUT https://localhost:9200/test -d'{
    "settings": {
        "index.codec": "best_compression",
        "index.refresh_interval": -1
    },
    "mappings": {
        "properties": {
            "many": {
                "enabled": false
            }
        }
    }
}'
for a in {1..1000}; do
    printf %04d: $a
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_bulk?pretty --data-binary @/tmp/bulk | grep errors
done
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_forcemerge?max_num_segments=1
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_refresh
echo
curl -sk -uelastic:$passwd https://localhost:9200/_cat/indices?v

text_eq() {
    echo -n "    text eq "
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
        "query": "FROM test | WHERE text == \"text 1\" | STATS COUNT(*)",
        "pragma": {
            "data_partitioning": "shard"
        }
    }' | jq -c '{took, documents_found}'
}

not_text_eq() {
    echo -n "not text eq "
    curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
        "query": "FROM test | WHERE NOT text == \"text 1\" | STATS COUNT(*)",
        "pragma": {
            "data_partitioning": "shard"
        }
    }' | jq -c '{took, documents_found}'
}


for a in {1..100}; do
    text_eq
    not_text_eq
done

Reenables `text ==` pushdown and adds support for `text !=` pushdown.

elasticsearchmachine · 2025-04-24T20:44:36Z

Hi @nik9000, I've created a changelog YAML for you.

nik9000

This seems to work but wants more docs and an explanation.

nik9000 · 2025-04-25T13:59:38Z

This also wants more digging to be sure its right. This is the first PR that pushes things to lucene and rechecks them. The change for that was surprisingly small and I don't trust it to be that easy.

nik9000 · 2025-05-05T21:25:17Z

OK! This is worth a look now!

elasticsearchmachine · 2025-05-05T21:25:32Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

nik9000 · 2025-05-06T12:53:38Z

I'm going to make some performance numbers for this today.

dnhatn

I left one comment, but this looks great - especially the docs and samples. Thank you!

dnhatn · 2025-05-06T23:35:27Z

.../plugin/esql/src/main/java/org/elasticsearch/xpack/esql/querydsl/query/SingleValueQuery.java

+            if (ft == null) {
+                return new MatchNoDocsQuery("missing field [" + field() + "]");
+            }
+            ft = ((TextFieldMapper.TextFieldType) ft).syntheticSourceDelegate();


I think SemanticTextFieldType and MatchOnlyTextFieldType are text fields, but their base classes are not TextFieldMapper.TextFieldType.

👍. Will look.

Update on this one: we don't push either of those fields. I've expanded the tests to hit that as well. We can pick them up in follow-up changes.

luigidellaquila

Neat! LGTM
The docs are just fantastic.

luigidellaquila · 2025-05-07T08:44:55Z

...ck/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/capabilities/TranslationAware.java

 /**
- * Expressions implementing this interface can get called on data nodes to provide an Elasticsearch/Lucene query.
+ * Expressions implementing this interface are asked provide an
+ * Elasticsearch/Lucene query on the as part of the data node optimizations.


on the is probably a leftover

luigidellaquila · 2025-05-07T08:45:50Z

...ck/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/capabilities/TranslationAware.java

+         */
+        YES(FinishedTranslatable.YES),
+        /**
+         * Translation requires a recheck. Calling {@link TranslationAware#asQuery} will


Reenables `text ==` pushdown and adds support for `text !=` pushdown. It does so by making `TranslationAware#translatable` return something we can turn into a tri-valued function. It has these values: * `YES` * `NO` * `RECHECK` `YES` means the `Expression` is entirely pushable into Lucene. They will be pushed into Lucene and removed from the plan. `NO` means the `Expression` can't be pushed to Lucene at all and will stay in the plan. `RECHECK` mean the `Expression` can push a query that makes *candidate* matches but must be rechecked. Documents that don't match the query won't match the expression, but documents that match the query might not match the expression. These are pushed to Lucene *and* left in the plan. This is required because `txt != "b"` can build a *candidate* query against the `txt.keyword` subfield but it can't be sure of the match without loading the `_source` - which we do in the compute engine. I haven't plugged rally into this, but here's some basic performance tests: ``` Before: not text eq {"took":460,"documents_found":1000000} text eq {"took":432,"documents_found":1000000} After: text eq {"took":5,"documents_found":1} not text eq {"took":351,"documents_found":800000} ``` This comes from: ``` rm -f /tmp/bulk* for a in {1..1000}; do echo '{"index":{}}' >> /tmp/bulk echo '{"text":"text '$(printf $(($a % 5)))'"}' >> /tmp/bulk done ls -l /tmp/bulk* passwd="redacted" curl -sk -uelastic:$passwd -HContent-Type:application/json -XDELETE https://localhost:9200/test curl -sk -uelastic:$passwd -HContent-Type:application/json -XPUT https://localhost:9200/test -d'{ "settings": { "index.codec": "best_compression", "index.refresh_interval": -1 }, "mappings": { "properties": { "many": { "enabled": false } } } }' for a in {1..1000}; do printf %04d: $a curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_bulk?pretty --data-binary @/tmp/bulk | grep errors done curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_forcemerge?max_num_segments=1 curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_refresh echo curl -sk -uelastic:$passwd https://localhost:9200/_cat/indices?v text_eq() { echo -n " text eq " curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{ "query": "FROM test | WHERE text == \"text 1\" | STATS COUNT(*)", "pragma": { "data_partitioning": "shard" } }' | jq -c '{took, documents_found}' } not_text_eq() { echo -n "not text eq " curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{ "query": "FROM test | WHERE NOT text == \"text 1\" | STATS COUNT(*)", "pragma": { "data_partitioning": "shard" } }' | jq -c '{took, documents_found}' } for a in {1..100}; do text_eq not_text_eq done ```

nik9000 · 2025-05-19T18:35:32Z

Backported by #128156

nik9000 added 2 commits April 24, 2025 09:27

ESQL: Reenable text == pushdown

78247a6

Reenables `text ==` pushdown and adds support for `text !=` pushdown.

Merge branch 'main' into half_pushable

34048e0

nik9000 added >enhancement :Analytics/ES|QL AKA ESQL v9.1.0 labels Apr 24, 2025

Update docs/changelog/127355.yaml

22a7c1d

nik9000 commented Apr 24, 2025

View reviewed changes

nik9000 and others added 12 commits April 30, 2025 13:03

Merge branch 'main' into half_pushable

4be4318

Fix test

3442f7b

Merge remote-tracking branch 'nik9000/half_pushable' into half_pushable

eeefc98

Merge branch 'main' into half_pushable

ec77a9f

Test and javadoc

8d93564

WIP

bb9b13a

[CI] Auto commit changes from spotless

83f02f8

Merge branch 'main' into half_pushable

2807083

Spotless:

400c00f

Merge remote-tracking branch 'nik9000/half_pushable' into half_pushable

2aea3f9

Fixup

d6146fb

More

db35b76

nik9000 marked this pull request as ready for review May 5, 2025 21:25

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 5, 2025

nik9000 added 3 commits May 6, 2025 10:01

Merge branch 'main' into half_pushable

156651d

Merge branch 'main' into half_pushable

093f89f

Merge branch 'main' into half_pushable

cef677b

dnhatn approved these changes May 6, 2025

View reviewed changes

luigidellaquila approved these changes May 7, 2025

View reviewed changes

nik9000 added 6 commits May 7, 2025 11:21

Merge branch 'main' into half_pushable

82b4e68

Fix test

641f7c8

Merge branch 'main' into half_pushable

9b0f018

semantic_text and match_only_text

3b7384e

Merge remote-tracking branch 'nik9000/half_pushable' into half_pushable

d9575d6

Merge branch 'main' into half_pushable

bd8d437

nik9000 enabled auto-merge (squash) May 8, 2025 12:56

nik9000 merged commit 3551494 into elastic:main May 8, 2025
16 of 17 checks passed

idegtiarenko mentioned this pull request May 19, 2025

[CI] PushQueriesIT testEqualityAndOther {semantic_text} failing #128122

Closed

nik9000 added the v8.19.0 label May 28, 2025

nik9000 mentioned this pull request May 28, 2025

[ES|QL] WHERE Command filtering can be significantly slower than DSL #128529

Open

nik9000 mentioned this pull request Jul 23, 2025

ESQL: document_found isn't always the number of "documents" found #131771

Open

ESQL: text == and text != pushdown #127355

ESQL: text == and text != pushdown #127355

Conversation

nik9000 commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 24, 2025

Uh oh!

nik9000 left a comment

Choose a reason for hiding this comment

Uh oh!

nik9000 commented Apr 25, 2025

Uh oh!

nik9000 commented May 5, 2025

Uh oh!

elasticsearchmachine commented May 5, 2025

Uh oh!

nik9000 commented May 6, 2025

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

dnhatn May 6, 2025

Choose a reason for hiding this comment

Uh oh!

nik9000 May 7, 2025

Choose a reason for hiding this comment

Uh oh!

nik9000 May 8, 2025

Choose a reason for hiding this comment

Uh oh!

dnhatn May 8, 2025

Choose a reason for hiding this comment

Uh oh!

luigidellaquila left a comment

Choose a reason for hiding this comment

Uh oh!

luigidellaquila May 7, 2025

Choose a reason for hiding this comment

Uh oh!

luigidellaquila May 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nik9000 commented May 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ESQL: `text ==` and `text !=` pushdown #127355

ESQL: `text ==` and `text !=` pushdown #127355

nik9000 commented Apr 24, 2025 •

edited

Loading