Add runtime checks for invalid uses of \K in lookaround #812

NWilson · 2025-10-06T09:30:52Z

Fixes #736

src/pcre2_jit_match_inc.h

src/pcre2_jit_compile.c

zherczeg · 2025-10-07T14:01:56Z

src/pcre2_match.c

      }

+    /* Fail if we detect that the start position was moved to be either after
+    the end position (\K in lookahead) or before the start offset (\K in


I am not sure I understand the operation.

Why start offset is a problem? I think the only issue is when ovector(0) > ovector(1), which confuses some simple implementations. Nobody complained about startoffset before.

Do this happens when \K is executed, or as a post check?

Start offset is a problem, because when you are doing "search all matches" you don't want matches to overlap or go "backwards". The canonical list of matches should be ordered, non-overlapping, and without duplicates. I believe that many clients will expect this. For example, pcre2_substitute itself fails if the list of matches is overlapping.

The checks are done as a post check, after a match is accepted (much, much later than when \K is encountered). These checks do not cause backtracking: it simply turns an accepted match into an error.

This rule is strange to me. As long as the end of the matches are <= ordered, the beginning should not matter when you do a global match. This is true now regardless of \K. Substitute (string insert) should not care where the match starts as long as it is <= than the end.

It does matter for substitute.

Here is the check that Philip added, long before my involvement:

pcre2/src/pcre2_substitute.c

Line 454 in d29e729

if (ovector[1] < ovector[0] || ovector[0] < start_offset)

Here's one way to think about it: a match that ends before it starts is weird. But also, if the unmatched text (text in between matches) that is backwards is equally weird.

Any application that does something with the text in between matches, rather than just throwing it away, wants the matches to be non overlapping.

Example:
Subject "abcde", and matches "a", "cd", "de".
If you wanted to split the string on matches (very common), you'd get:

Empty unmatched text at start
a, match
b, unmatched
cd, match
!!! Unmatched d backwards
de, match
Empty unmatched text at end

Find and replace is based around the concept of string split. If you wanted to replace all those three matches with "zzz", what would the result even be? After replacing "cd" with "zzz", how do you even begin to then replace "de"?

The invariant must be that for both matches and the unmatched text in between, the substring start is before the end (or "not after").

Finally, even if applications only want a single match, it is surprising (and possibly broken) to get a match starting at offset 3, if you ask for searching to start from offset 4.

Using max(start_offset, ovector[0]) is ok for replacing. Throwing errors is not necessary. Especially that the pattern cannot detect this case.

src/pcre2_jit_compile.c

zherczeg · 2025-10-09T18:11:57Z

src/pcre2_jit_compile.c

+  OP1(SLJIT_MOV, SLJIT_RETURN_REG, 0, SLJIT_IMM, PCRE2_ERROR_BAD_BACKSLASH_K);
+  add_jump(compiler, &common->abort, JUMP(SLJIT_LESS));
+  /* (ovector[0] > STR_PTR)?  NB. ovector[1] hasn't yet been set to STR_PTR. */
+  add_jump(compiler, &common->abort, CMP(SLJIT_GREATER, TMP2, 0, STR_PTR, 0));


The good is good. I am still unsure what would be the best way to handle \K. Throwing an error is one thing, but how this error should be handled by users?

The expected way to handle error is by... fixing the regex? Or at least, stopping iteration if you're searching for matches. This is an incredibly obscure use-case: the user has deliberately or accidentally used subroutines to escape out of the protection offered by Perl's rule disallowing \K within lookaround.

If users write these regexes still... well, I suppose the user can always enable the PCRE flag to allow it again, and handle it in their application. But 99% of applications won't be able to handle this gracefully, so it's good to forbid it by default.

NWilson · 2025-10-14T15:05:38Z

And this one, do you object to me merging?

zherczeg · 2025-10-14T15:20:46Z

Nope. Lets merge it.

Add runtime checks for invalid uses of \K in lookaround

81b08f3

NWilson commented Oct 6, 2025

View reviewed changes

src/pcre2_jit_match_inc.h Outdated Show resolved Hide resolved

NWilson commented Oct 7, 2025

View reviewed changes

src/pcre2_jit_compile.c Show resolved Hide resolved

Attempts at JIT code

1621949

NWilson force-pushed the user/niwilson/block-bad-bsk branch from 59ef11f to 1621949 Compare October 7, 2025 11:14

zherczeg reviewed Oct 7, 2025

View reviewed changes

Fix my JIT bug

60a83f3

NWilson force-pushed the user/niwilson/block-bad-bsk branch from c580431 to 60a83f3 Compare October 8, 2025 11:19

Fix x86 (32-bit) build; TMP3 can't be used as a source register

d5e178a

zherczeg requested changes Oct 8, 2025

View reviewed changes

src/pcre2_jit_compile.c Outdated Show resolved Hide resolved

src/pcre2_jit_compile.c Outdated Show resolved Hide resolved

NWilson added 2 commits October 8, 2025 17:34

Remove unnecessary registers

11afce6

Remove one of the jumps

ee12ec5

zherczeg reviewed Oct 8, 2025

View reviewed changes

src/pcre2_jit_compile.c Outdated Show resolved Hide resolved

Yet another attempt at JIT code

355c69c

zherczeg reviewed Oct 9, 2025

View reviewed changes

src/pcre2_jit_compile.c Outdated Show resolved Hide resolved

Yet another attempt... removes use of TMP3

ffc32d5

zherczeg reviewed Oct 9, 2025

View reviewed changes

NWilson requested a review from zherczeg October 9, 2025 20:42

NWilson merged commit c742166 into master Oct 14, 2025
35 checks passed

NWilson deleted the user/niwilson/block-bad-bsk branch October 14, 2025 17:01

Add runtime checks for invalid uses of \K in lookaround #812

Add runtime checks for invalid uses of \K in lookaround #812

Conversation

NWilson commented Oct 6, 2025

Uh oh!

Uh oh!

Uh oh!

zherczeg Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

NWilson Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

zherczeg Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

NWilson Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zherczeg Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zherczeg Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

NWilson Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

NWilson commented Oct 14, 2025

Uh oh!

zherczeg commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NWilson Oct 8, 2025 •

edited

Loading