ESQL: Pushdown constructs doing case-insensitive regexes #128393

bpintea · 2025-05-23T14:53:40Z

This introduces an optimization to pushdown to Lucense those language constructs that aim at case-insensitive regular expression matching, used with LIKE and RLIKE operators, such as:

| WHERE TO_LOWER(field) LIKE "abc*"
| WHERE TO_UPPER(field) RLIKE "ABC.*"

These are now pushed as case-insensitive wildcard and regexp respectively queries down to Lucene.

Closes #127479

This introduces an optimization to pushdown to Lucense those language constructs that aim case-insensitive regular expression matching, used with LIKE and RLIKE operators, such as: * `| WHERE TO_LOWER(field) LIKE "abc*"` * `| WHERE TO_UPPER(field) RLIKE `ABC.*` These are now pushed as case-insensitive `regexp` and `wildcard` respectively queries down to Lucene.

elasticsearchmachine · 2025-05-23T14:54:04Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

elasticsearchmachine · 2025-05-23T14:54:05Z

Hi @bpintea, I've created a changelog YAML for you.

bpintea · 2025-05-23T15:04:54Z

...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/RLike.java

+        if (caseInsensitive() && out.getTransportVersion().before(TransportVersions.ESQL_REGEX_MATCH_WITH_CASE_INSENSITIVITY)) {
+            // The plan has been optimized to run a case-insensitive match, which the remote peer cannot be notified of. Simply avoiding
+            // the serialization of the boolean would result in wrong results.
+            throw new EsqlIllegalArgumentException(
+                NAME + " with case insensitivity is not supported in peer node's version [{}]. Upgrade to version [{}] or newer.",
+                out.getTransportVersion(),
+                TransportVersions.ESQL_REGEX_MATCH_WITH_CASE_INSENSITIVITY
+            );
+        }


Pointing out this fast-failure here // bwc issue.

We haven't been serialising the case-insensitivity of the regexp operators, so far. We now need to.

Not sure we have an alternative to failing the query in this case, without introducing some compensating mechanisms like rerouting queries to old nodes for planning. The planner isn't currently aware of the versions it should plan for.

Introducing new operators (RLIKE2?) would still fail on old nodes; but it would maybe allow the user to opt out of the optimisation, so that the query on mixed-versions won't fail?

Another case for us needing to know if we can enable the optimization based on the minimum node version, or something like it.

We can't make these decisions on the local node only?

In this case we could be safe, see my comment below, but in general I agree that having minimum node version (or node features, or whatever) at planning level would help a lot.

We can't make these decisions on the local node only?

We probably should be able to pull that into a rule that applies locally-only, yes.. Thanks!
That would allows to apply this optimisation, tough the serialisation issue itself will remain.

Extracted the logic into local optimiser-only rule.
The exception triggering is left in place, though it's now no longer triggerable, as any new functionality to turn the RegexMatch caseInsensitive will follow after introducing the serialisation of the boolean.

luigidellaquila

I had a quick look and left a first round of comments, the implementation looks good in general.

My real concern is on the breaking change: TO_LOWER and TO_UPPER are the most used functions in ES|QL, it means that introducing a breaking change here would impact most of our users.
Probably we have an escape hatch though, see my comment below.

luigidellaquila · 2025-05-23T15:31:09Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/where-like.csv-spec

+FROM employees
+| KEEP emp_no, first_name
+| SORT emp_no
+| WHERE TO_UPPER(first_name) LIKE "GEOR*"


Could you please add a couple of negative tests, eg. where TO_UPPER tries to match a lowercase pattern

FROM employees | KEEP emp_no, first_name | SORT emp_no | WHERE TO_UPPER(first_name) LIKE "geor*"

Added more tests (the logical folding is tested as well)

luigidellaquila · 2025-05-23T15:33:31Z

x-pack/plugin/esql/qa/testFixtures/src/main/resources/where-like.csv-spec

+10055          |Georgy
+;
+
+likeWithLower


If we decide to accept the breaking change, you'll need at least a new capability, otherwise these tests will fail on mixed clusters.

No longer needed.

luigidellaquila · 2025-05-23T15:45:13Z

...gin/esql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizerTests.java

 import org.elasticsearch.xpack.esql.expression.function.scalar.multivalue.MvSum;
 import org.elasticsearch.xpack.esql.expression.function.scalar.nulls.Coalesce;
 import org.elasticsearch.xpack.esql.expression.function.scalar.string.Concat;
+import org.elasticsearch.xpack.esql.expression.function.scalar.string.RLike;


I think we'll also need more unit tests, ie. a few more cases in WildcardLikeTests and RLikeTests

Yep, added those now too.

luigidellaquila · 2025-05-23T15:51:29Z

...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/RLike.java

        source().writeTo(out);
        out.writeNamedWriteable(field());
        out.writeString(pattern().asJavaRegex());
+        if (caseInsensitive() && out.getTransportVersion().before(TransportVersions.ESQL_REGEX_MATCH_WITH_CASE_INSENSITIVITY)) {


Can't we just re-add the to_upper/lower() function on the fly if we are on an old transport version?

We'll have to be careful with layouts and NameIDs, but probably it's not impossible.

This would also give us a bwc advantage if we decide to add case insensitive operators to the grammar.

Clever idea: serialize foo rlike "aAa" to --> to_lower(foo) rlike to_lower "aAa".
This is best done in the planner but since we don't have the versioning in place, we can do this locally for rlike and like.

Another suggestion that might be cleaner would be to perform the optimization on the data nodes only, not on the coordinator to avoid the difference in serialization.

costin

Thanks for picking this one up! LGTM to me once the serialization/compatibility stuff gets sorted out.

costin · 2025-05-23T21:02:32Z

...src/main/java/org/elasticsearch/xpack/esql/core/expression/predicate/regex/RLikePattern.java


    @Override
-    public Automaton createAutomaton() {
+    public Automaton createAutomaton(boolean ignoreCase) {


Expose ignoreCase as a property in StringPattern since it affects both the Automaton and javaRegex. The former can contain the mode but the latter doesn't so we need a way to bubble it.

The pattern is independent of how it's used for matching, casing-wise. The java regex version has it's own mechanism to flag case insensitivity and not sure it'd be trivial, or "safe", or even needed to modify it based on a method parameter.
But even if we updated the StringPattern interface, we'd have to recreate the object if the RegexMatch requires case-insenstive matching, since the StringPattern object is created at parsing time (when it's not known if the matching will be case insensitive or not).
Furthermore, the matchesAll() and exactMatch() methods of AbstractStringPattern also calling automaton() are invariant to casing.
So not sure if we'd need any more changes, but if there's a better solution here, happy to apply it.

costin · 2025-05-23T21:03:50Z

.../main/java/org/elasticsearch/xpack/esql/core/expression/predicate/regex/WildcardPattern.java

Same comment as above - make the parameter a class property.

costin · 2025-05-23T21:04:48Z

x-pack/plugin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/util/StringUtils.java

+    public static String luceneWildcardToRegExp(String wildcard) {
+        StringBuilder regex = new StringBuilder();
+
+        for (int i = 0, wcLen = wildcard.length(); i < wcLen; i++) {


costin · 2025-05-23T21:11:49Z

...esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/RLike.java

        source().writeTo(out);
        out.writeNamedWriteable(field());
        out.writeString(pattern().asJavaRegex());
+        if (caseInsensitive() && out.getTransportVersion().before(TransportVersions.ESQL_REGEX_MATCH_WITH_CASE_INSENSITIVITY)) {


Clever idea: serialize foo rlike "aAa" to --> to_lower(foo) rlike to_lower "aAa".
This is best done in the planner but since we don't have the versioning in place, we can do this locally for rlike and like.

Another suggestion that might be cleaner would be to perform the optimization on the data nodes only, not on the coordinator to avoid the difference in serialization.

…ity_regex

bpintea · 2025-05-29T10:35:36Z

...l-core/src/main/java/org/elasticsearch/xpack/esql/core/expression/predicate/regex/RLike.java

Dropped the now useless proxy-class.

bpintea · 2025-05-29T10:37:30Z

...t/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/WildcardLikeTests.java

-        List<Object[]> cases = (List<Object[]>) RLikeTests.parameters(str -> {
-            for (String syntax : new String[] { "\\", "*" }) {
+        final Function<String, String> escapeString = str -> {
+            for (String syntax : new String[] { "\\", "*", "?" }) {


? needs escaping too for the wildcard matching.

bpintea · 2025-05-29T10:43:13Z

...in/esql/src/test/java/org/elasticsearch/xpack/esql/optimizer/PhysicalPlanOptimizerTests.java

        // System.out.println("Physical\n" + physical);
        if (assertSerialization) {
-            assertSerialization(physical);
+            assertSerialization(physical, config);


Not sure how this didn't trigger earlier. The instances of certain functions (like TO_UPPER/_LOWER), contain references to the config, which contains the instant it was created at. So comparing these function instances before & after serialization compares the configs too, which needs to be equal.

luigidellaquila

LGTM, thanks!

luigidellaquila · 2025-05-29T10:57:31Z

...gin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LocalLogicalPlanOptimizer.java

-        return newBatches;
+
+        // add rule that should only apply locally
+        newRules.add(new ReplaceStringCasingWithInsensitiveRegexMatch());


luigidellaquila · 2025-05-29T11:00:12Z

...in/java/org/elasticsearch/xpack/esql/expression/function/scalar/string/regex/RegexMatch.java

+                throw new EsqlIllegalArgumentException(
+                    name() + " with case insensitivity is not supported in peer node's version [{}]. Upgrade to version [{}] or newer.",
+                    out.getTransportVersion(),
+                    TransportVersions.ESQL_REGEX_MATCH_WITH_CASE_INSENSITIVITY
+                );


This should never happen, but I like the idea of throwing an exception here, as a defense from possible future mistakes.

…ity_regex

bpintea · 2025-05-30T08:54:28Z

Thanks folks!

) This introduces an optimization to pushdown to Lucense those language constructs that aim at case-insensitive regular expression matching, used with `LIKE` and `RLIKE` operators, such as: * `| WHERE TO_LOWER(field) LIKE "abc*"` * `| WHERE TO_UPPER(field) RLIKE "ABC.*"` These are now pushed as case-insensitive `wildcard` and `regexp` respectively queries down to Lucene. Closes elastic#127479

) (#128750) (#128753) (#128919) * ESQL: Pushdown constructs doing case-insensitive regexes (#128393) This introduces an optimization to pushdown to Lucense those language constructs that aim at case-insensitive regular expression matching, used with `LIKE` and `RLIKE` operators, such as: * `| WHERE TO_LOWER(field) LIKE "abc*"` * `| WHERE TO_UPPER(field) RLIKE "ABC.*"` These are now pushed as case-insensitive `wildcard` and `regexp` respectively queries down to Lucene. Closes #127479 (cherry picked from commit 0a80916) * ESQL: Fix conversion of a Lucene wildcard pattern to a regexp (#128750) This adds the reserved optional characters to the list that is escaped during conversion. These characters are all enabled by the `RegExp.ALL` flag in our use. Closes #128676, closes #128677. (cherry picked from commit 5eb54bf) * ESQL: Fix case-insensitive test generation with Unicodes (#128753) This excludes from testing the strings containing Unicode chars that change length when changing case. Closes #128705 Closes #128706 Closes #128710 Closes #128711 Closes Closes #128717 Closes #128789 Closes #128790 Closes #128791 Closes (cherry picked from commit 092d4ba) * [CI] Auto commit changes from spotless * Java21 adaptations and automerge fixes * [CI] Auto commit changes from spotless * 8.x's Lucene/RegExp doesn't support case-insensitive matching * [CI] Auto commit changes from spotless * One more Lucene 9 fix --------- Co-authored-by: elasticsearchmachine <[email protected]>

) This introduces an optimization to pushdown to Lucense those language constructs that aim at case-insensitive regular expression matching, used with `LIKE` and `RLIKE` operators, such as: * `| WHERE TO_LOWER(field) LIKE "abc*"` * `| WHERE TO_UPPER(field) RLIKE "ABC.*"` These are now pushed as case-insensitive `wildcard` and `regexp` respectively queries down to Lucene. Closes elastic#127479

…es (elastic#128393) (elastic#128750) (elastic#128753) (elastic#128919)" This reverts commit 34e08fb.

bpintea added >enhancement :Analytics/ES|QL AKA ESQL v9.1.0 labels May 23, 2025

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 23, 2025

Update docs/changelog/128393.yaml

3525880

[CI] Auto commit changes from spotless

f9b8e79

bpintea commented May 23, 2025

View reviewed changes

luigidellaquila self-requested a review May 23, 2025 15:22

luigidellaquila reviewed May 23, 2025

View reviewed changes

costin reviewed May 23, 2025

View reviewed changes

costin self-requested a review May 23, 2025 21:12

costin approved these changes May 23, 2025

View reviewed changes

bpintea and others added 9 commits May 26, 2025 22:33

Merge remote-tracking branch 'upstream/main' into enh/case_insensitiv…

ae057c6

…ity_regex

Insensivite RegexMatch optimization applies locally

fdcdcb9

Drop useless proxy classes. Add insensitivity tests

07411e1

Merge remote-tracking branch 'upstream/main' into enh/case_insensitiv…

7fa6686

…ity_regex

spotless

bc25956

[CI] Auto commit changes from spotless

17b6db2

Update test names

f70bc27

remove leftover

cb105fa

Merge remote-tracking branch 'upstream/main' into enh/case_insensitiv…

03c9202

…ity_regex

bpintea commented May 29, 2025

View reviewed changes

luigidellaquila approved these changes May 29, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into enh/case_insensitiv…

10cd899

…ity_regex

bpintea added the v8.19.0 label May 30, 2025

bpintea merged commit 0a80916 into elastic:main May 30, 2025
18 checks passed

bpintea deleted the enh/case_insensitivity_regex branch May 30, 2025 08:55

bpintea mentioned this pull request May 30, 2025

ESQL: Have CombineProjections propagate references upwards #127264

Merged

bpintea added the backport pending label May 30, 2025

bpintea removed the backport pending label Jun 5, 2025

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jun 5, 2025

Revert "[8.19] ESQL: Pushdown constructs doing case-insensitive regex…

b2e0272

…es (elastic#128393) (elastic#128750) (elastic#128753) (elastic#128919)" This reverts commit 34e08fb.

alex-spies mentioned this pull request Jun 10, 2025

[CI] MixedClusterEsqlSpecIT test {where-like.RlikeWithUpperTurnedInsensitive SYNC} failing #129103

Closed

ESQL: Pushdown constructs doing case-insensitive regexes #128393

ESQL: Pushdown constructs doing case-insensitive regexes #128393

Uh oh!

Conversation

bpintea commented May 23, 2025

Uh oh!

elasticsearchmachine commented May 23, 2025

Uh oh!

elasticsearchmachine commented May 23, 2025

Uh oh!

bpintea May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luigidellaquila left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luigidellaquila May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

costin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luigidellaquila left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bpintea commented May 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

bpintea May 23, 2025 •

edited

Loading

luigidellaquila May 23, 2025 •

edited

Loading

costin left a comment •

edited

Loading