Skip to content

Conversation

john-wagster
Copy link
Contributor

fixed character class and ranges lacking optimizations after improvements to regexp in lucene (14193)

@john-wagster john-wagster requested a review from iverase April 2, 2025 16:33
@john-wagster john-wagster added >bug :Search Relevance/Search Catch all for Search Relevance v9.1.0 labels Apr 3, 2025
Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit on the fence here as we are trying to optimize more than before so we would need to expand our test and perform some performance test?

@john-wagster
Copy link
Contributor Author

I am a bit on the fence here as we are trying to optimize more than before so we would need to expand our test and perform some performance test?

@iverase Good question I'm not sure if we do need more tests? I believe this essentially is optimizing what was previously optimized? Maybe we are catching some additional cases but my guess is that those cases we probably did previously catch and then slowly as Lucene has been optimized away from Union operations we've been missing them here. So I would buy we just need more test coverage here in general?

And sorry I should have provided some of this detail in the summary. But why I think this is pretty close to the same is that previously a Union for character class and range was a combination of single characters that were part of that range (or a set of ranges as a character class). Those single code points were optimized into the from and to fields to limit poor Automata construction (which I happened to be generating when doing some of the related regex work and then Robert M helped fix a good bit in that linked Lucene PR). This means iterating over the set of code points from from to to should be equivalent to what the Union operation was doing previously, which is why the existing test now passes and produces the same optimized outcome as it did previously. Specifically this:

            {
                "[Pp][Oo][Ww][Ee][Rr][Ss][Hh][Ee][Ll][Ll]\\.[Ee][Xx][Ee]",
                "+_oo +oow +owe +weq +eqs +qsg +sge +gek +ekk +kk\\/ +k\\/e +\\/ew +ewe +we_ +e__" },

Having said that, I can completely understand the apprehension here. My only counter to that is that this particular use case will be a regression from prior versions in terms of performance. I'll defer to your / groups wisdom here though. I can put a PR that removes that ^ specific test instead for now, which should cause the test to pass as it's no longer being optimized. And target this PR against main for a subsequent release. Thoughts?

@iverase
Copy link
Contributor

iverase commented Apr 3, 2025

We are now optimising REGEXP_CHAR_RANGE where before we were not. This change makes me wonder if we might be adding cases where we will construct enormous boolean queries which can cause issues.

I am thinking if we might only support REGEXP_CHAR_CLASS and we only optimize it iff from == to. I think that covers what we are currently optimizing.

@iverase
Copy link
Contributor

iverase commented Apr 3, 2025

By the way, you might want to merge the latest changes in the lucene_snapshot branch to get rid of most of the CI issues.

@john-wagster
Copy link
Contributor Author

We are now optimising REGEXP_CHAR_RANGE where before we were not. This change makes me wonder if we might be adding cases where we will construct enormous boolean queries which can cause issues.

I am thinking if we might only support REGEXP_CHAR_CLASS and we only optimize it iff from == to. I think that covers what we are currently optimizing.

Makes sense and seems like a reasonable compromise. I've updated the code to reflect that. I'll pull this out of draft as it seems like we might be narrowing down to a good solution for now.

@john-wagster john-wagster marked this pull request as ready for review April 3, 2025 11:25
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Apr 3, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for iterating on it.

@john-wagster john-wagster merged commit 3b8a4e6 into elastic:lucene_snapshot Apr 3, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Search Relevance/Search Catch all for Search Relevance Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants