-
Notifications
You must be signed in to change notification settings - Fork 243
Add runtime checks for invalid uses of \K in lookaround #812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
81b08f3
1621949
60a83f3
d5e178a
11afce6
ee12ec5
355c69c
ffc32d5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -1010,11 +1010,28 @@ fprintf(stderr, "++ %2ld op=%3d %s\n", Fecode - mb->start_code, *Fecode, | |||
| } | ||||
|
|
||||
| #ifdef DEBUG_SHOW_OPS | ||||
| fprintf(stderr, "++ Failed ACCEPT not at end (endanchnored set)\n"); | ||||
| fprintf(stderr, "++ Failed ACCEPT not at end (endanchored set)\n"); | ||||
| #endif | ||||
| return MATCH_NOMATCH; /* (*ACCEPT) */ | ||||
| } | ||||
|
|
||||
| /* Fail if we detect that the start position was moved to be either after | ||||
| the end position (\K in lookahead) or before the start offset (\K in | ||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure I understand the operation. Why start offset is a problem? I think the only issue is when ovector(0) > ovector(1), which confuses some simple implementations. Nobody complained about startoffset before. Do this happens when \K is executed, or as a post check?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This rule is strange to me. As long as the end of the matches are <= ordered, the beginning should not matter when you do a global match. This is true now regardless of \K. Substitute (string insert) should not care where the match starts as long as it is <= than the end.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It does matter for substitute. Here is the check that Philip added, long before my involvement: Line 454 in d29e729
Here's one way to think about it: a match that ends before it starts is weird. But also, if the unmatched text (text in between matches) that is backwards is equally weird. Any application that does something with the text in between matches, rather than just throwing it away, wants the matches to be non overlapping. Example: Empty unmatched text at start Find and replace is based around the concept of string split. If you wanted to replace all those three matches with "zzz", what would the result even be? After replacing "cd" with "zzz", how do you even begin to then replace "de"? The invariant must be that for both matches and the unmatched text in between, the substring start is before the end (or "not after"). Finally, even if applications only want a single match, it is surprising (and possibly broken) to get a match starting at offset 3, if you ask for searching to start from offset 4.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using |
||||
| lookbehind). If this occurs, the pattern must have used \K in a somewhat | ||||
| sneaky way (e.g. by pattern recursion), because if the \K is actually | ||||
| syntactically inside the lookaround, it's blocked at compile-time. */ | ||||
|
|
||||
| if (Fstart_match < mb->start_subject + mb->start_offset || | ||||
| Fstart_match > Feptr) | ||||
| { | ||||
| /* The \K expression is fairly rare. We assert it was used so that we | ||||
| catch any unexpected invalid data in start_match. */ | ||||
| PCRE2_ASSERT(mb->hasbsk); | ||||
|
|
||||
| if (!mb->allowlookaroundbsk) | ||||
| return PCRE2_ERROR_BAD_BACKSLASH_K; | ||||
| } | ||||
|
|
||||
| /* We have a successful match of the whole pattern. Record the result and | ||||
| then do a direct return from the function. If there is space in the offset | ||||
| vector, set any pairs that follow the highest-numbered captured string but | ||||
|
|
@@ -7393,8 +7410,11 @@ mb->start_offset = start_offset; | |||
| mb->end_subject = end_subject; | ||||
| mb->true_end_subject = true_end_subject; | ||||
| mb->hasthen = (re->flags & PCRE2_HASTHEN) != 0; | ||||
| mb->hasbsk = (re->flags & PCRE2_HASBSK) != 0; | ||||
| mb->allowemptypartial = (re->max_lookbehind > 0) || | ||||
| (re->flags & PCRE2_MATCH_EMPTY) != 0; | ||||
| mb->allowlookaroundbsk = | ||||
| (re->extra_options & PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK) != 0; | ||||
| mb->poptions = re->overall_options; /* Pattern options */ | ||||
| mb->ignore_skip_arg = 0; | ||||
| mb->mark = mb->nomatch_mark = NULL; /* In case never set */ | ||||
|
|
||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The good is good. I am still unsure what would be the best way to handle \K. Throwing an error is one thing, but how this error should be handled by users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected way to handle error is by... fixing the regex? Or at least, stopping iteration if you're searching for matches. This is an incredibly obscure use-case: the user has deliberately or accidentally used subroutines to escape out of the protection offered by Perl's rule disallowing \K within lookaround.
If users write these regexes still... well, I suppose the user can always enable the PCRE flag to allow it again, and handle it in their application. But 99% of applications won't be able to handle this gracefully, so it's good to forbid it by default.