Skip to content

Conversation

@NWilson
Copy link
Member

@NWilson NWilson commented Oct 6, 2025

Fixes #736

@NWilson NWilson force-pushed the user/niwilson/block-bad-bsk branch from 59ef11f to 1621949 Compare October 7, 2025 11:14
}

/* Fail if we detect that the start position was moved to be either after
the end position (\K in lookahead) or before the start offset (\K in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the operation.

Why start offset is a problem? I think the only issue is when ovector(0) > ovector(1), which confuses some simple implementations. Nobody complained about startoffset before.

Do this happens when \K is executed, or as a post check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Start offset is a problem, because when you are doing "search all matches" you don't want matches to overlap or go "backwards". The canonical list of matches should be ordered, non-overlapping, and without duplicates. I believe that many clients will expect this. For example, pcre2_substitute itself fails if the list of matches is overlapping.
  • The checks are done as a post check, after a match is accepted (much, much later than when \K is encountered). These checks do not cause backtracking: it simply turns an accepted match into an error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule is strange to me. As long as the end of the matches are <= ordered, the beginning should not matter when you do a global match. This is true now regardless of \K. Substitute (string insert) should not care where the match starts as long as it is <= than the end.

Copy link
Member Author

@NWilson NWilson Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does matter for substitute.

Here is the check that Philip added, long before my involvement:

if (ovector[1] < ovector[0] || ovector[0] < start_offset)

Here's one way to think about it: a match that ends before it starts is weird. But also, if the unmatched text (text in between matches) that is backwards is equally weird.

Any application that does something with the text in between matches, rather than just throwing it away, wants the matches to be non overlapping.

Example:
Subject "abcde", and matches "a", "cd", "de".
If you wanted to split the string on matches (very common), you'd get:

Empty unmatched text at start
a, match
b, unmatched
cd, match
!!! Unmatched d backwards
de, match
Empty unmatched text at end

Find and replace is based around the concept of string split. If you wanted to replace all those three matches with "zzz", what would the result even be? After replacing "cd" with "zzz", how do you even begin to then replace "de"?

The invariant must be that for both matches and the unmatched text in between, the substring start is before the end (or "not after").

Finally, even if applications only want a single match, it is surprising (and possibly broken) to get a match starting at offset 3, if you ask for searching to start from offset 4.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using max(start_offset, ovector[0]) is ok for replacing. Throwing errors is not necessary. Especially that the pattern cannot detect this case.

@NWilson NWilson force-pushed the user/niwilson/block-bad-bsk branch from c580431 to 60a83f3 Compare October 8, 2025 11:19
OP1(SLJIT_MOV, SLJIT_RETURN_REG, 0, SLJIT_IMM, PCRE2_ERROR_BAD_BACKSLASH_K);
add_jump(compiler, &common->abort, JUMP(SLJIT_LESS));
/* (ovector[0] > STR_PTR)? NB. ovector[1] hasn't yet been set to STR_PTR. */
add_jump(compiler, &common->abort, CMP(SLJIT_GREATER, TMP2, 0, STR_PTR, 0));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The good is good. I am still unsure what would be the best way to handle \K. Throwing an error is one thing, but how this error should be handled by users?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected way to handle error is by... fixing the regex? Or at least, stopping iteration if you're searching for matches. This is an incredibly obscure use-case: the user has deliberately or accidentally used subroutines to escape out of the protection offered by Perl's rule disallowing \K within lookaround.

If users write these regexes still... well, I suppose the user can always enable the PCRE flag to allow it again, and handle it in their application. But 99% of applications won't be able to handle this gracefully, so it's good to forbid it by default.

@NWilson NWilson requested a review from zherczeg October 9, 2025 20:42
@NWilson
Copy link
Member Author

NWilson commented Oct 14, 2025

And this one, do you object to me merging?

@zherczeg
Copy link
Collaborator

Nope. Lets merge it.

@NWilson NWilson merged commit c742166 into master Oct 14, 2025
35 checks passed
@NWilson NWilson deleted the user/niwilson/block-bad-bsk branch October 14, 2025 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

\K in lookbehind/lookahead should be always invalid

3 participants