Skip to content

Remove duplicated scan substring captures#710

Merged
zherczeg merged 1 commit intoPCRE2Project:masterfrom
zherczeg:optimize_scs_lists
Mar 4, 2025
Merged

Remove duplicated scan substring captures#710
zherczeg merged 1 commit intoPCRE2Project:masterfrom
zherczeg:optimize_scs_lists

Conversation

@zherczeg
Copy link
Collaborator

This patch optimizes the argument list of scan substring by removing duplications. Multinames are only removed if all corresponding captures are removed as well.

Not a simple patch unfortunately.

It would not be bad, if equality here could be checked somehow:
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile.c#L10854

The greater case is only needed for a few patterns. Don't know how to do this. Maybe tests should contain a # option to enforce equality, and that could be disabled for specific patterns.

@zherczeg
Copy link
Collaborator Author

For example test1 has 69 patterns where greater happens. Some of them might be wrong.

@NWilson
Copy link
Member

NWilson commented Feb 27, 2025

Do we really need to remove duplicates? If a user (strangely) writes a scan_substring or (?N:(...)) list with duplicated entries, can't we just ignore that, and process them left-to-right without caring?

We don't need to "optimise" things that are unnatural. Only handle the case if the code somehow breaks when there are duplicates.

@zherczeg
Copy link
Collaborator Author

Not necessary, but useful in a few cases (e.g. machine generated patterns). The code is not that big either.

@zherczeg
Copy link
Collaborator Author

I can remove the duplication check if you insist on it. I think the patch still improves the code without it.

Copy link
Member

@NWilson NWilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the code is better overall. I've left some comments on a couple of things though.

I don't want to block you if you disagree. They're not big things, and I don't feel too strongly.

SKIPOFFSET(pptr);
continue;

case META_CAPTURE_NAME:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err. I don't know how I feel about this, how you've moved the META_CAPTURE_NAME/NUMBER code so it's now hanging under the META_OFFSET handling. I'd named & implemented META_OFFSET in an intentionally generic way, so it could be used outside of scan-substring capture lists.

Do we have to make this change? Maybe we should just merge META_OFFSET and META_SCS now, if you do want to go in this direction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think they can be merged. META_SCS starts a recursive call to process its block, and another opcode is needed to show that we are processing scan substring. We could change the code generator to support optional arguments (it currently expects a fixed size byte code), but that is complex enough.

Option: the 16 bit arg of META_OFFSET is not used, it could represent the different types of blocks if needed.

Comment on lines -6595 to -6600
if (meta == META_CAPTURE_NAME)
{
code += 1 + IMM2_SIZE;
break;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand I do agree that all these META_CAPTURE_NAME/NUMBER special cases were really ugly down here, and it's definitely better to be able to move them somewhere else where they can be handled together.

if (PRIV(compile_is_capture_checked)(pptr[-1], captures, captures_end))
{
pptr[-1] = 0;
continue;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. It really isn't much code to check for duplicates.

At the moment, regex compiling is O(n^2) because we do several linear searches in various places. I'm planning to hunt down and kill them, to get rid of denial-of-service attacks.

(There was a post about this on the pcre2-dev mailing list a while back. Some Erlang users wanted to be able to bound the amount of time that PCRE2 blocks the main loop. Making the parser worst-case O(n logn) would be a nice promise to make to users about how bad the pathological cases can get.)

Would it be really pedantic and annoying to drop the duplicate search, or else make a temporary sorted copy and do it that way instead?

No users will ever enter duplicates anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can use a bitset (each capture is one bit) as an alternative. The memory needed is (max_capture+7/8) bytes.

@zherczeg zherczeg force-pushed the optimize_scs_lists branch from c5c74da to 979782c Compare March 1, 2025 07:31
@zherczeg
Copy link
Collaborator Author

zherczeg commented Mar 1, 2025

The algorithm is changed to linear.

@zherczeg
Copy link
Collaborator Author

zherczeg commented Mar 3, 2025

I plan to land this patch soon

@NWilson
Copy link
Member

NWilson commented Mar 3, 2025

I plan to land this patch soon

Terrific, thank you for the reminder to review it Zoltan and for making the changes.

@zherczeg zherczeg merged commit e1737b5 into PCRE2Project:master Mar 4, 2025
34 checks passed
@zherczeg zherczeg deleted the optimize_scs_lists branch March 4, 2025 05:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants