Remove duplicated scan substring captures by zherczeg · Pull Request #710 · PCRE2Project/pcre2

zherczeg · 2025-02-27T11:02:53Z

This patch optimizes the argument list of scan substring by removing duplications. Multinames are only removed if all corresponding captures are removed as well.

Not a simple patch unfortunately.

It would not be bad, if equality here could be checked somehow:
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile.c#L10854

The greater case is only needed for a few patterns. Don't know how to do this. Maybe tests should contain a # option to enforce equality, and that could be disabled for specific patterns.

zherczeg · 2025-02-27T11:15:02Z

For example test1 has 69 patterns where greater happens. Some of them might be wrong.

NWilson · 2025-02-27T14:02:31Z

Do we really need to remove duplicates? If a user (strangely) writes a scan_substring or (?N:(...)) list with duplicated entries, can't we just ignore that, and process them left-to-right without caring?

We don't need to "optimise" things that are unnatural. Only handle the case if the code somehow breaks when there are duplicates.

zherczeg · 2025-02-27T15:03:36Z

Not necessary, but useful in a few cases (e.g. machine generated patterns). The code is not that big either.

zherczeg · 2025-02-28T05:53:35Z

I can remove the duplication check if you insist on it. I think the patch still improves the code without it.

NWilson

You're right, the code is better overall. I've left some comments on a couple of things though.

I don't want to block you if you disagree. They're not big things, and I don't feel too strongly.

NWilson · 2025-02-28T20:56:40Z

src/pcre2_compile.c

+        SKIPOFFSET(pptr);
+        continue;
+
+        case META_CAPTURE_NAME:


Err. I don't know how I feel about this, how you've moved the META_CAPTURE_NAME/NUMBER code so it's now hanging under the META_OFFSET handling. I'd named & implemented META_OFFSET in an intentionally generic way, so it could be used outside of scan-substring capture lists.

Do we have to make this change? Maybe we should just merge META_OFFSET and META_SCS now, if you do want to go in this direction.

I don't think they can be merged. META_SCS starts a recursive call to process its block, and another opcode is needed to show that we are processing scan substring. We could change the code generator to support optional arguments (it currently expects a fixed size byte code), but that is complex enough.

Option: the 16 bit arg of META_OFFSET is not used, it could represent the different types of blocks if needed.

NWilson · 2025-02-28T20:58:33Z

src/pcre2_compile.c

-        if (meta == META_CAPTURE_NAME)
-          {
-          code += 1 + IMM2_SIZE;
-          break;
-          }
-


On the other hand I do agree that all these META_CAPTURE_NAME/NUMBER special cases were really ugly down here, and it's definitely better to be able to move them somewhere else where they can be handled together.

NWilson · 2025-02-28T21:05:28Z

src/pcre2_compile_cgroup.c

+    if (PRIV(compile_is_capture_checked)(pptr[-1], captures, captures_end))
+      {
+      pptr[-1] = 0;
+      continue;
+      }


I see what you mean. It really isn't much code to check for duplicates.

At the moment, regex compiling is O(n^2) because we do several linear searches in various places. I'm planning to hunt down and kill them, to get rid of denial-of-service attacks.

(There was a post about this on the pcre2-dev mailing list a while back. Some Erlang users wanted to be able to bound the amount of time that PCRE2 blocks the main loop. Making the parser worst-case O(n logn) would be a nice promise to make to users about how bad the pathological cases can get.)

Would it be really pedantic and annoying to drop the duplicate search, or else make a temporary sorted copy and do it that way instead?

No users will ever enter duplicates anyway.

I can use a bitset (each capture is one bit) as an alternative. The memory needed is (max_capture+7/8) bytes.

zherczeg · 2025-03-01T07:32:44Z

The algorithm is changed to linear.

zherczeg · 2025-03-03T17:42:28Z

I plan to land this patch soon

NWilson · 2025-03-03T21:55:41Z

I plan to land this patch soon

Terrific, thank you for the reminder to review it Zoltan and for making the changes.

zherczeg force-pushed the optimize_scs_lists branch from 9f37e19 to dc1a5e7 Compare February 27, 2025 12:54

zherczeg force-pushed the optimize_scs_lists branch from dc1a5e7 to c5c74da Compare February 28, 2025 05:49

NWilson reviewed Feb 28, 2025

View reviewed changes

Remove duplicated scan substring captures

979782c

zherczeg force-pushed the optimize_scs_lists branch from c5c74da to 979782c Compare March 1, 2025 07:31

NWilson approved these changes Mar 3, 2025

View reviewed changes

zherczeg merged commit e1737b5 into PCRE2Project:master Mar 4, 2025
34 checks passed

zherczeg deleted the optimize_scs_lists branch March 4, 2025 05:25

Conversation

zherczeg commented Feb 27, 2025

Uh oh!

zherczeg commented Feb 27, 2025

Uh oh!

NWilson commented Feb 27, 2025

Uh oh!

zherczeg commented Feb 27, 2025

Uh oh!

zherczeg commented Feb 28, 2025

Uh oh!

NWilson left a comment

Choose a reason for hiding this comment

Uh oh!

NWilson Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

zherczeg Mar 1, 2025

Choose a reason for hiding this comment

Uh oh!

NWilson Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

NWilson Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

zherczeg Mar 1, 2025

Choose a reason for hiding this comment

Uh oh!

zherczeg commented Mar 1, 2025

Uh oh!

zherczeg commented Mar 3, 2025

Uh oh!

NWilson commented Mar 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants