Fixed Character Class and Ranges WildCardQuery Optimizations #126154

john-wagster · 2025-04-02T16:33:12Z

fixed character class and ranges lacking optimizations after improvements to regexp in lucene (14193)

…ents to regexp in lucene (14193)

...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java

iverase

I am a bit on the fence here as we are trying to optimize more than before so we would need to expand our test and perform some performance test?

john-wagster · 2025-04-03T11:01:33Z

I am a bit on the fence here as we are trying to optimize more than before so we would need to expand our test and perform some performance test?

@iverase Good question I'm not sure if we do need more tests? I believe this essentially is optimizing what was previously optimized? Maybe we are catching some additional cases but my guess is that those cases we probably did previously catch and then slowly as Lucene has been optimized away from Union operations we've been missing them here. So I would buy we just need more test coverage here in general?

And sorry I should have provided some of this detail in the summary. But why I think this is pretty close to the same is that previously a Union for character class and range was a combination of single characters that were part of that range (or a set of ranges as a character class). Those single code points were optimized into the from and to fields to limit poor Automata construction (which I happened to be generating when doing some of the related regex work and then Robert M helped fix a good bit in that linked Lucene PR). This means iterating over the set of code points from from to to should be equivalent to what the Union operation was doing previously, which is why the existing test now passes and produces the same optimized outcome as it did previously. Specifically this:

            {
                "[Pp][Oo][Ww][Ee][Rr][Ss][Hh][Ee][Ll][Ll]\\.[Ee][Xx][Ee]",
                "+_oo +oow +owe +weq +eqs +qsg +sge +gek +ekk +kk\\/ +k\\/e +\\/ew +ewe +we_ +e__" },

Having said that, I can completely understand the apprehension here. My only counter to that is that this particular use case will be a regression from prior versions in terms of performance. I'll defer to your / groups wisdom here though. I can put a PR that removes that ^ specific test instead for now, which should cause the test to pass as it's no longer being optimized. And target this PR against main for a subsequent release. Thoughts?

iverase · 2025-04-03T11:11:13Z

We are now optimising REGEXP_CHAR_RANGE where before we were not. This change makes me wonder if we might be adding cases where we will construct enormous boolean queries which can cause issues.

I am thinking if we might only support REGEXP_CHAR_CLASS and we only optimize it iff from == to. I think that covers what we are currently optimizing.

iverase · 2025-04-03T11:20:53Z

By the way, you might want to merge the latest changes in the lucene_snapshot branch to get rid of most of the CI issues.

john-wagster · 2025-04-03T11:25:06Z

We are now optimising REGEXP_CHAR_RANGE where before we were not. This change makes me wonder if we might be adding cases where we will construct enormous boolean queries which can cause issues.

I am thinking if we might only support REGEXP_CHAR_CLASS and we only optimize it iff from == to. I think that covers what we are currently optimizing.

Makes sense and seems like a reasonable compromise. I've updated the code to reflect that. I'll pull this out of draft as it seems like we might be narrowing down to a good solution for now.

elasticsearchmachine · 2025-04-03T11:25:31Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java

iverase

LGTM. Thanks for iterating on it.

john-wagster added 2 commits April 2, 2025 11:30

fixed character class and ranges lacking optimizations after improvem…

5b78223

…ents to regexp in lucene (14193)

iter

bdbb404

john-wagster requested a review from iverase April 2, 2025 16:33

john-wagster added 2 commits April 2, 2025 11:35

spotless

1903f44

iter

c8e4834

john-wagster added >bug :Search Relevance/Search Catch all for Search Relevance v9.1.0 labels Apr 3, 2025

iverase reviewed Apr 3, 2025

View reviewed changes

...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java Outdated Show resolved Hide resolved

iverase reviewed Apr 3, 2025

View reviewed changes

...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java Outdated Show resolved Hide resolved

iverase reviewed Apr 3, 2025

View reviewed changes

iter

a958631

john-wagster added 2 commits April 3, 2025 06:20

iter

e622ec1

Merge branch 'lucene_snapshot' into wildcarfieldmapper_optimizations

2ff7aa9

john-wagster marked this pull request as ready for review April 3, 2025 11:25

elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Apr 3, 2025

iverase reviewed Apr 3, 2025

View reviewed changes

...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java Outdated Show resolved Hide resolved

iter

1567fd5

iverase approved these changes Apr 3, 2025

View reviewed changes

john-wagster merged commit 3b8a4e6 into elastic:lucene_snapshot Apr 3, 2025
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixed Character Class and Ranges WildCardQuery Optimizations #126154

Fixed Character Class and Ranges WildCardQuery Optimizations #126154

Uh oh!

john-wagster commented Apr 2, 2025

Uh oh!

Uh oh!

Uh oh!

iverase left a comment

Uh oh!

john-wagster commented Apr 3, 2025

Uh oh!

iverase commented Apr 3, 2025

Uh oh!

iverase commented Apr 3, 2025

Uh oh!

john-wagster commented Apr 3, 2025

Uh oh!

elasticsearchmachine commented Apr 3, 2025

Uh oh!

Uh oh!

iverase left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fixed Character Class and Ranges WildCardQuery Optimizations #126154

Fixed Character Class and Ranges WildCardQuery Optimizations #126154

Uh oh!

Conversation

john-wagster commented Apr 2, 2025

Uh oh!

Uh oh!

Uh oh!

iverase left a comment

Choose a reason for hiding this comment

Uh oh!

john-wagster commented Apr 3, 2025

Uh oh!

iverase commented Apr 3, 2025

Uh oh!

iverase commented Apr 3, 2025

Uh oh!

john-wagster commented Apr 3, 2025

Uh oh!

elasticsearchmachine commented Apr 3, 2025

Uh oh!

Uh oh!

iverase left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants