Remove duplicated scan substring captures#710
Conversation
|
For example test1 has 69 patterns where greater happens. Some of them might be wrong. |
9f37e19 to
dc1a5e7
Compare
|
Do we really need to remove duplicates? If a user (strangely) writes a scan_substring or We don't need to "optimise" things that are unnatural. Only handle the case if the code somehow breaks when there are duplicates. |
|
Not necessary, but useful in a few cases (e.g. machine generated patterns). The code is not that big either. |
dc1a5e7 to
c5c74da
Compare
|
I can remove the duplication check if you insist on it. I think the patch still improves the code without it. |
NWilson
left a comment
There was a problem hiding this comment.
You're right, the code is better overall. I've left some comments on a couple of things though.
I don't want to block you if you disagree. They're not big things, and I don't feel too strongly.
| SKIPOFFSET(pptr); | ||
| continue; | ||
|
|
||
| case META_CAPTURE_NAME: |
There was a problem hiding this comment.
Err. I don't know how I feel about this, how you've moved the META_CAPTURE_NAME/NUMBER code so it's now hanging under the META_OFFSET handling. I'd named & implemented META_OFFSET in an intentionally generic way, so it could be used outside of scan-substring capture lists.
Do we have to make this change? Maybe we should just merge META_OFFSET and META_SCS now, if you do want to go in this direction.
There was a problem hiding this comment.
I don't think they can be merged. META_SCS starts a recursive call to process its block, and another opcode is needed to show that we are processing scan substring. We could change the code generator to support optional arguments (it currently expects a fixed size byte code), but that is complex enough.
Option: the 16 bit arg of META_OFFSET is not used, it could represent the different types of blocks if needed.
| if (meta == META_CAPTURE_NAME) | ||
| { | ||
| code += 1 + IMM2_SIZE; | ||
| break; | ||
| } | ||
|
|
There was a problem hiding this comment.
On the other hand I do agree that all these META_CAPTURE_NAME/NUMBER special cases were really ugly down here, and it's definitely better to be able to move them somewhere else where they can be handled together.
src/pcre2_compile_cgroup.c
Outdated
| if (PRIV(compile_is_capture_checked)(pptr[-1], captures, captures_end)) | ||
| { | ||
| pptr[-1] = 0; | ||
| continue; | ||
| } |
There was a problem hiding this comment.
I see what you mean. It really isn't much code to check for duplicates.
At the moment, regex compiling is O(n^2) because we do several linear searches in various places. I'm planning to hunt down and kill them, to get rid of denial-of-service attacks.
(There was a post about this on the pcre2-dev mailing list a while back. Some Erlang users wanted to be able to bound the amount of time that PCRE2 blocks the main loop. Making the parser worst-case O(n logn) would be a nice promise to make to users about how bad the pathological cases can get.)
Would it be really pedantic and annoying to drop the duplicate search, or else make a temporary sorted copy and do it that way instead?
No users will ever enter duplicates anyway.
There was a problem hiding this comment.
I can use a bitset (each capture is one bit) as an alternative. The memory needed is (max_capture+7/8) bytes.
c5c74da to
979782c
Compare
|
The algorithm is changed to linear. |
|
I plan to land this patch soon |
Terrific, thank you for the reminder to review it Zoltan and for making the changes. |
This patch optimizes the argument list of scan substring by removing duplications. Multinames are only removed if all corresponding captures are removed as well.
Not a simple patch unfortunately.
It would not be bad, if equality here could be checked somehow:
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile.c#L10854
The greater case is only needed for a few patterns. Don't know how to do this. Maybe tests should contain a
#option to enforce equality, and that could be disabled for specific patterns.