-
Notifications
You must be signed in to change notification settings - Fork 25.6k
ESQL: text == and text != pushdown
#127355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reenables `text ==` pushdown and adds support for `text !=` pushdown.
|
Hi @nik9000, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to work but wants more docs and an explanation.
|
This also wants more digging to be sure its right. This is the first PR that pushes things to lucene and rechecks them. The change for that was surprisingly small and I don't trust it to be that easy. |
|
OK! This is worth a look now! |
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
I'm going to make some performance numbers for this today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left one comment, but this looks great - especially the docs and samples. Thank you!
| if (ft == null) { | ||
| return new MatchNoDocsQuery("missing field [" + field() + "]"); | ||
| } | ||
| ft = ((TextFieldMapper.TextFieldType) ft).syntheticSourceDelegate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think SemanticTextFieldType and MatchOnlyTextFieldType are text fields, but their base classes are not TextFieldMapper.TextFieldType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍. Will look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update on this one: we don't push either of those fields. I've expanded the tests to hit that as well. We can pick them up in follow-up changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat! LGTM
The docs are just fantastic.
| /** | ||
| * Expressions implementing this interface can get called on data nodes to provide an Elasticsearch/Lucene query. | ||
| * Expressions implementing this interface are asked provide an | ||
| * Elasticsearch/Lucene query on the as part of the data node optimizations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the is probably a leftover
| */ | ||
| YES(FinishedTranslatable.YES), | ||
| /** | ||
| * Translation requires a recheck. Calling {@link TranslationAware#asQuery} will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Reenables `text ==` pushdown and adds support for `text !=` pushdown.
It does so by making `TranslationAware#translatable` return something
we can turn into a tri-valued function. It has these values:
* `YES`
* `NO`
* `RECHECK`
`YES` means the `Expression` is entirely pushable into Lucene. They will
be pushed into Lucene and removed from the plan.
`NO` means the `Expression` can't be pushed to Lucene at all and will stay
in the plan.
`RECHECK` mean the `Expression` can push a query that makes *candidate*
matches but must be rechecked. Documents that don't match the query won't
match the expression, but documents that match the query might not match
the expression. These are pushed to Lucene *and* left in the plan.
This is required because `txt != "b"` can build a *candidate* query
against the `txt.keyword` subfield but it can't be sure of the match
without loading the `_source` - which we do in the compute engine.
I haven't plugged rally into this, but here's some basic
performance tests:
```
Before:
not text eq {"took":460,"documents_found":1000000}
text eq {"took":432,"documents_found":1000000}
After:
text eq {"took":5,"documents_found":1}
not text eq {"took":351,"documents_found":800000}
```
This comes from:
```
rm -f /tmp/bulk*
for a in {1..1000}; do
echo '{"index":{}}' >> /tmp/bulk
echo '{"text":"text '$(printf $(($a % 5)))'"}' >> /tmp/bulk
done
ls -l /tmp/bulk*
passwd="redacted"
curl -sk -uelastic:$passwd -HContent-Type:application/json -XDELETE https://localhost:9200/test
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPUT https://localhost:9200/test -d'{
"settings": {
"index.codec": "best_compression",
"index.refresh_interval": -1
},
"mappings": {
"properties": {
"many": {
"enabled": false
}
}
}
}'
for a in {1..1000}; do
printf %04d: $a
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_bulk?pretty --data-binary @/tmp/bulk | grep errors
done
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_forcemerge?max_num_segments=1
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_refresh
echo
curl -sk -uelastic:$passwd https://localhost:9200/_cat/indices?v
text_eq() {
echo -n " text eq "
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
"query": "FROM test | WHERE text == \"text 1\" | STATS COUNT(*)",
"pragma": {
"data_partitioning": "shard"
}
}' | jq -c '{took, documents_found}'
}
not_text_eq() {
echo -n "not text eq "
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
"query": "FROM test | WHERE NOT text == \"text 1\" | STATS COUNT(*)",
"pragma": {
"data_partitioning": "shard"
}
}' | jq -c '{took, documents_found}'
}
for a in {1..100}; do
text_eq
not_text_eq
done
```
Reenables `text ==` pushdown and adds support for `text !=` pushdown.
It does so by making `TranslationAware#translatable` return something
we can turn into a tri-valued function. It has these values:
* `YES`
* `NO`
* `RECHECK`
`YES` means the `Expression` is entirely pushable into Lucene. They will
be pushed into Lucene and removed from the plan.
`NO` means the `Expression` can't be pushed to Lucene at all and will stay
in the plan.
`RECHECK` mean the `Expression` can push a query that makes *candidate*
matches but must be rechecked. Documents that don't match the query won't
match the expression, but documents that match the query might not match
the expression. These are pushed to Lucene *and* left in the plan.
This is required because `txt != "b"` can build a *candidate* query
against the `txt.keyword` subfield but it can't be sure of the match
without loading the `_source` - which we do in the compute engine.
I haven't plugged rally into this, but here's some basic
performance tests:
```
Before:
not text eq {"took":460,"documents_found":1000000}
text eq {"took":432,"documents_found":1000000}
After:
text eq {"took":5,"documents_found":1}
not text eq {"took":351,"documents_found":800000}
```
This comes from:
```
rm -f /tmp/bulk*
for a in {1..1000}; do
echo '{"index":{}}' >> /tmp/bulk
echo '{"text":"text '$(printf $(($a % 5)))'"}' >> /tmp/bulk
done
ls -l /tmp/bulk*
passwd="redacted"
curl -sk -uelastic:$passwd -HContent-Type:application/json -XDELETE https://localhost:9200/test
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPUT https://localhost:9200/test -d'{
"settings": {
"index.codec": "best_compression",
"index.refresh_interval": -1
},
"mappings": {
"properties": {
"many": {
"enabled": false
}
}
}
}'
for a in {1..1000}; do
printf %04d: $a
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_bulk?pretty --data-binary @/tmp/bulk | grep errors
done
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_forcemerge?max_num_segments=1
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_refresh
echo
curl -sk -uelastic:$passwd https://localhost:9200/_cat/indices?v
text_eq() {
echo -n " text eq "
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
"query": "FROM test | WHERE text == \"text 1\" | STATS COUNT(*)",
"pragma": {
"data_partitioning": "shard"
}
}' | jq -c '{took, documents_found}'
}
not_text_eq() {
echo -n "not text eq "
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
"query": "FROM test | WHERE NOT text == \"text 1\" | STATS COUNT(*)",
"pragma": {
"data_partitioning": "shard"
}
}' | jq -c '{took, documents_found}'
}
for a in {1..100}; do
text_eq
not_text_eq
done
```
Reenables `text ==` pushdown and adds support for `text !=` pushdown.
It does so by making `TranslationAware#translatable` return something
we can turn into a tri-valued function. It has these values:
* `YES`
* `NO`
* `RECHECK`
`YES` means the `Expression` is entirely pushable into Lucene. They will
be pushed into Lucene and removed from the plan.
`NO` means the `Expression` can't be pushed to Lucene at all and will stay
in the plan.
`RECHECK` mean the `Expression` can push a query that makes *candidate*
matches but must be rechecked. Documents that don't match the query won't
match the expression, but documents that match the query might not match
the expression. These are pushed to Lucene *and* left in the plan.
This is required because `txt != "b"` can build a *candidate* query
against the `txt.keyword` subfield but it can't be sure of the match
without loading the `_source` - which we do in the compute engine.
I haven't plugged rally into this, but here's some basic
performance tests:
```
Before:
not text eq {"took":460,"documents_found":1000000}
text eq {"took":432,"documents_found":1000000}
After:
text eq {"took":5,"documents_found":1}
not text eq {"took":351,"documents_found":800000}
```
This comes from:
```
rm -f /tmp/bulk*
for a in {1..1000}; do
echo '{"index":{}}' >> /tmp/bulk
echo '{"text":"text '$(printf $(($a % 5)))'"}' >> /tmp/bulk
done
ls -l /tmp/bulk*
passwd="redacted"
curl -sk -uelastic:$passwd -HContent-Type:application/json -XDELETE https://localhost:9200/test
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPUT https://localhost:9200/test -d'{
"settings": {
"index.codec": "best_compression",
"index.refresh_interval": -1
},
"mappings": {
"properties": {
"many": {
"enabled": false
}
}
}
}'
for a in {1..1000}; do
printf %04d: $a
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_bulk?pretty --data-binary @/tmp/bulk | grep errors
done
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_forcemerge?max_num_segments=1
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST https://localhost:9200/test/_refresh
echo
curl -sk -uelastic:$passwd https://localhost:9200/_cat/indices?v
text_eq() {
echo -n " text eq "
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
"query": "FROM test | WHERE text == \"text 1\" | STATS COUNT(*)",
"pragma": {
"data_partitioning": "shard"
}
}' | jq -c '{took, documents_found}'
}
not_text_eq() {
echo -n "not text eq "
curl -sk -uelastic:$passwd -HContent-Type:application/json -XPOST 'https://localhost:9200/_query?pretty' -d'{
"query": "FROM test | WHERE NOT text == \"text 1\" | STATS COUNT(*)",
"pragma": {
"data_partitioning": "shard"
}
}' | jq -c '{took, documents_found}'
}
for a in {1..100}; do
text_eq
not_text_eq
done
```
|
Backported by #128156 |
Reenables
text ==pushdown and adds support fortext !=pushdown.It does so by making
TranslationAware#translatablereturn something we can turn into a tri-valued function. It has these values:YESNORECHECKYESmeans theExpressionis entirely pushable into Lucene. They will be pushed into Lucene and removed from the plan.NOmeans theExpressioncan't be pushed to Lucene at all and will stay in the plan.RECHECKmean theExpressioncan push a query that makes candidate matches but must be rechecked. Documents that don't match the query won't match the expression, but documents that match the query might not match the expression.These are pushed to Lucene and left in the plan.
This is required because
txt != "b"can build a candidate query against thetxt.keywordsubfield but it can't be sure of the match without loading the_source- which we do in the compute engine.I haven't plugged rally into this, but here's some basic performance tests:
This comes from: