-
Notifications
You must be signed in to change notification settings - Fork 242
Add runtime checks for invalid uses of \K in lookaround #812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
59ef11f to
1621949
Compare
| } | ||
|
|
||
| /* Fail if we detect that the start position was moved to be either after | ||
| the end position (\K in lookahead) or before the start offset (\K in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand the operation.
Why start offset is a problem? I think the only issue is when ovector(0) > ovector(1), which confuses some simple implementations. Nobody complained about startoffset before.
Do this happens when \K is executed, or as a post check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Start offset is a problem, because when you are doing "search all matches" you don't want matches to overlap or go "backwards". The canonical list of matches should be ordered, non-overlapping, and without duplicates. I believe that many clients will expect this. For example, pcre2_substitute itself fails if the list of matches is overlapping.
- The checks are done as a post check, after a match is accepted (much, much later than when \K is encountered). These checks do not cause backtracking: it simply turns an accepted match into an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule is strange to me. As long as the end of the matches are <= ordered, the beginning should not matter when you do a global match. This is true now regardless of \K. Substitute (string insert) should not care where the match starts as long as it is <= than the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does matter for substitute.
Here is the check that Philip added, long before my involvement:
Line 454 in d29e729
| if (ovector[1] < ovector[0] || ovector[0] < start_offset) |
Here's one way to think about it: a match that ends before it starts is weird. But also, if the unmatched text (text in between matches) that is backwards is equally weird.
Any application that does something with the text in between matches, rather than just throwing it away, wants the matches to be non overlapping.
Example:
Subject "abcde", and matches "a", "cd", "de".
If you wanted to split the string on matches (very common), you'd get:
Empty unmatched text at start
a, match
b, unmatched
cd, match
!!! Unmatched d backwards
de, match
Empty unmatched text at end
Find and replace is based around the concept of string split. If you wanted to replace all those three matches with "zzz", what would the result even be? After replacing "cd" with "zzz", how do you even begin to then replace "de"?
The invariant must be that for both matches and the unmatched text in between, the substring start is before the end (or "not after").
Finally, even if applications only want a single match, it is surprising (and possibly broken) to get a match starting at offset 3, if you ask for searching to start from offset 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using max(start_offset, ovector[0]) is ok for replacing. Throwing errors is not necessary. Especially that the pattern cannot detect this case.
c580431 to
60a83f3
Compare
| OP1(SLJIT_MOV, SLJIT_RETURN_REG, 0, SLJIT_IMM, PCRE2_ERROR_BAD_BACKSLASH_K); | ||
| add_jump(compiler, &common->abort, JUMP(SLJIT_LESS)); | ||
| /* (ovector[0] > STR_PTR)? NB. ovector[1] hasn't yet been set to STR_PTR. */ | ||
| add_jump(compiler, &common->abort, CMP(SLJIT_GREATER, TMP2, 0, STR_PTR, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The good is good. I am still unsure what would be the best way to handle \K. Throwing an error is one thing, but how this error should be handled by users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected way to handle error is by... fixing the regex? Or at least, stopping iteration if you're searching for matches. This is an incredibly obscure use-case: the user has deliberately or accidentally used subroutines to escape out of the protection offered by Perl's rule disallowing \K within lookaround.
If users write these regexes still... well, I suppose the user can always enable the PCRE flag to allow it again, and handle it in their application. But 99% of applications won't be able to handle this gracefully, so it's good to forbid it by default.
|
And this one, do you object to me merging? |
|
Nope. Lets merge it. |
Fixes #736