Skip to content

Commit 8e83acc

Browse files
committed
Upgrade interpreter to match JIT in handling of nested pattern recursions
1 parent 86919c9 commit 8e83acc

File tree

13 files changed

+87
-62
lines changed

13 files changed

+87
-62
lines changed

ChangeLog

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -172,8 +172,12 @@ undefined behaviour.
172172

173173
47. Refactor the handling of whole-pattern recursion (?0) in pcre2_match() so
174174
that its end is handled similarly to other recursions. This has altered the
175-
behaviour of /|(?0)./endanchored which was previously not right. However,
176-
it still differs from JIT.
175+
behaviour of /|(?0)./endanchored which was previously not right.
176+
177+
48. Improved the test for looping recursion by checking the last referenced
178+
character as well as the current character. This allows some patterns that
179+
previously triggered the check to run to completion instead of giving the loop
180+
error.
177181

178182

179183
Version 10.42 11-December-2022

doc/html/pcre2compat.html

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -246,6 +246,12 @@ <h1>pcre2compat man page</h1>
246246
PCRE2_UTF and PCRE2_UCP will use similar rules to Perl's "/u"; something closer
247247
to "/a" could be selected by adding other PCRE2_EXTRA_ASCII* options on top.
248248
</P>
249+
<P>
250+
22. Some recursive patterns that Perl diagnoses as infinite recursions can be
251+
handled by PCRE2, either by the interpreter or the JIT. An example is
252+
/(?:|(?0)abcd)(?(R)|\z)/, which matches a sequence of any number of repeated
253+
"abcd" substrings at the end of the subject.
254+
</P>
249255
<br><b>
250256
AUTHOR
251257
</b><br>
@@ -261,7 +267,7 @@ <h1>pcre2compat man page</h1>
261267
REVISION
262268
</b><br>
263269
<P>
264-
Last updated: 12 October 2023
270+
Last updated: 30 November 2023
265271
<br>
266272
Copyright &copy; 1997-2023 University of Cambridge.
267273
<br>

doc/pcre2.txt

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5256,6 +5256,11 @@ DIFFERENCES BETWEEN PCRE2 AND PERL
52565256
to Perl's "/u"; something closer to "/a" could be selected by adding
52575257
other PCRE2_EXTRA_ASCII* options on top.
52585258

5259+
22. Some recursive patterns that Perl diagnoses as infinite recursions
5260+
can be handled by PCRE2, either by the interpreter or the JIT. An exam-
5261+
ple is /(?:|(?0)abcd)(?(R)|\z)/, which matches a sequence of any number
5262+
of repeated "abcd" substrings at the end of the subject.
5263+
52595264

52605265
AUTHOR
52615266

@@ -5266,11 +5271,11 @@ AUTHOR
52665271

52675272
REVISION
52685273

5269-
Last updated: 12 October 2023
5274+
Last updated: 30 November 2023
52705275
Copyright (c) 1997-2023 University of Cambridge.
52715276

52725277

5273-
PCRE2 10.43 19 September 2023 PCRE2COMPAT(3)
5278+
PCRE2 10.43 30 November 2023 PCRE2COMPAT(3)
52745279
------------------------------------------------------------------------------
52755280

52765281

doc/pcre2compat.3

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.TH PCRE2COMPAT 3 "19 September 2023" "PCRE2 10.43"
1+
.TH PCRE2COMPAT 3 "30 November 2023" "PCRE2 10.43"
22
.SH NAME
33
PCRE2 - Perl-compatible regular expressions (revised API)
44
.SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
@@ -210,6 +210,11 @@ fall into any stack-overflow limit. PCRE2 made a similar change at release
210210
to set characters by context just like Perl's "/d". A regular expression using
211211
PCRE2_UTF and PCRE2_UCP will use similar rules to Perl's "/u"; something closer
212212
to "/a" could be selected by adding other PCRE2_EXTRA_ASCII* options on top.
213+
.P
214+
22. Some recursive patterns that Perl diagnoses as infinite recursions can be
215+
handled by PCRE2, either by the interpreter or the JIT. An example is
216+
/(?:|(?0)abcd)(?(R)|\ez)/, which matches a sequence of any number of repeated
217+
"abcd" substrings at the end of the subject.
213218
.
214219
.
215220
.SH AUTHOR
@@ -226,6 +231,6 @@ Cambridge, England.
226231
.rs
227232
.sp
228233
.nf
229-
Last updated: 12 October 2023
234+
Last updated: 30 November 2023
230235
Copyright (c) 1997-2023 University of Cambridge.
231236
.fi

doc/pcre2demo.3

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.TH PCRE2DEMO 3 "24 November 2023" "PCRE2 10.43-DEV"
1+
.TH PCRE2DEMO 3 "30 November 2023" "PCRE2 10.43-DEV"
22
.\"AUTOMATICALLY GENERATED BY PrepareRelease - do not EDIT!
33
.SH NAME
44
// - A demonstration C program for PCRE2 - //

maint/README

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -448,12 +448,9 @@ years.
448448
gigabyte memory, but perhaps another implementation might be considered.
449449
Needs coordination between the interpreters and JIT.
450450

451-
. There are regular requests for variable-length lookbehinds. An implementation
452-
exists but is missing JIT support.
453-
454451
. See also any suggestions in the GitHub issues.
455452

456453
Philip Hazel
457454
Email local part: Philip.Hazel
458455
Email domain: gmail.com
459-
Last updated: 30 September 2023
456+
Last updated: 30 November 2023

src/pcre2_dfa_match.c

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2913,7 +2913,6 @@ for (;;)
29132913
int *local_workspace;
29142914
PCRE2_SIZE *local_offsets;
29152915
RWS_anchor *rws = (RWS_anchor *)RWS;
2916-
dfa_recursion_info *ri;
29172916
PCRE2_SPTR callpat = start_code + GET(code, 1);
29182917
uint32_t recno = (callpat == mb->start_code)? 0 :
29192918
GET2(callpat, 1 + LINK_SIZE);
@@ -2930,18 +2929,24 @@ for (;;)
29302929
rws->free -= RWS_RSIZE + RWS_OVEC_RSIZE;
29312930

29322931
/* Check for repeating a recursion without advancing the subject
2933-
pointer. This should catch convoluted mutual recursions. (Some simple
2934-
cases are caught at compile time.) */
2932+
pointer or last used character. This should catch convoluted mutual
2933+
recursions. (Some simple cases are caught at compile time.) */
29352934

2936-
for (ri = mb->recursive; ri != NULL; ri = ri->prevrec)
2937-
if (recno == ri->group_num && ptr == ri->subject_position)
2935+
for (dfa_recursion_info *ri = mb->recursive;
2936+
ri != NULL;
2937+
ri = ri->prevrec)
2938+
{
2939+
if (recno == ri->group_num && ptr == ri->subject_position &&
2940+
mb->last_used_ptr == ri->last_used_ptr)
29382941
return PCRE2_ERROR_RECURSELOOP;
2942+
}
29392943

29402944
/* Remember this recursion and where we started it so as to
29412945
catch infinite loops. */
29422946

29432947
new_recursive.group_num = recno;
29442948
new_recursive.subject_position = ptr;
2949+
new_recursive.last_used_ptr = mb->last_used_ptr;
29452950
new_recursive.prevrec = mb->recursive;
29462951
mb->recursive = &new_recursive;
29472952

@@ -4015,7 +4020,7 @@ for (;;)
40154020
}
40164021
match_data->subject_length = length;
40174022
match_data->leftchar = (PCRE2_SIZE)(mb->start_used_ptr - subject);
4018-
match_data->rightchar = (PCRE2_SIZE)( mb->last_used_ptr - subject);
4023+
match_data->rightchar = (PCRE2_SIZE)(mb->last_used_ptr - subject);
40194024
match_data->startchar = (PCRE2_SIZE)(start_match - subject);
40204025
match_data->rc = rc;
40214026

src/pcre2_intmodedep.h

Lines changed: 18 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -677,8 +677,8 @@ typedef struct pcre2_real_match_data {
677677

678678
#ifndef PCRE2_PCRE2TEST
679679

680-
/* Structures for checking for mutual recursion when scanning compiled or
681-
parsed code. */
680+
/* Structures for checking for mutual function recursion when scanning compiled
681+
or parsed code. */
682682

683683
typedef struct recurse_check {
684684
struct recurse_check *prev;
@@ -690,7 +690,7 @@ typedef struct parsed_recurse_check {
690690
uint32_t *groupptr;
691691
} parsed_recurse_check;
692692

693-
/* Structure for building a cache when filling in recursion offsets. */
693+
/* Structure for building a cache when filling in pattern recursion offsets. */
694694

695695
typedef struct recurse_cache {
696696
PCRE2_SPTR group;
@@ -757,7 +757,7 @@ typedef struct compile_block {
757757
int max_lookbehind; /* Maximum lookbehind encountered (characters) */
758758
BOOL had_accept; /* (*ACCEPT) encountered */
759759
BOOL had_pruneorskip; /* (*PRUNE) or (*SKIP) encountered */
760-
BOOL had_recurse; /* Had a recursion or subroutine call */
760+
BOOL had_recurse; /* Had a pattern recursion or subroutine call */
761761
BOOL dupnames; /* Duplicate names exist */
762762
} compile_block;
763763

@@ -775,6 +775,7 @@ call within the pattern when running pcre2_dfa_match(). */
775775
typedef struct dfa_recursion_info {
776776
struct dfa_recursion_info *prevrec;
777777
PCRE2_SPTR subject_position;
778+
PCRE2_SPTR last_used_ptr;
778779
uint32_t group_num;
779780
} dfa_recursion_info;
780781

@@ -795,7 +796,7 @@ typedef struct heapframe {
795796
PCRE2_SIZE length; /* Used for character, string, or code lengths */
796797
PCRE2_SIZE back_frame; /* Amount to subtract on RRETURN */
797798
PCRE2_SIZE temp_size; /* Used for short-term PCRE2_SIZE values */
798-
uint32_t rdepth; /* "Recursion" depth */
799+
uint32_t rdepth; /* Function "recursion" depth within pcre2_match() */
799800
uint32_t group_frame_type; /* Type information for group frames */
800801
uint32_t temp_32[4]; /* Used for short-term 32-bit or BOOL values */
801802
uint8_t return_id; /* Where to go on in internal "return" */
@@ -828,14 +829,15 @@ typedef struct heapframe {
828829
allows for exactly the right size ovector for the number of capturing
829830
parentheses. (See also the comment for pcre2_real_match_data above.) */
830831

831-
PCRE2_SPTR eptr; /* MUST BE FIRST */
832-
PCRE2_SPTR start_match; /* Can be adjusted by \K */
833-
PCRE2_SPTR mark; /* Most recent mark on the success path */
834-
uint32_t current_recurse; /* Current (deepest) recursion number */
835-
uint32_t capture_last; /* Most recent capture */
836-
PCRE2_SIZE last_group_offset; /* Saved offset to most recent group frame */
837-
PCRE2_SIZE offset_top; /* Offset after highest capture */
838-
PCRE2_SIZE ovector[131072]; /* Must be last in the structure */
832+
PCRE2_SPTR eptr; /* MUST BE FIRST */
833+
PCRE2_SPTR start_match; /* Can be adjusted by \K */
834+
PCRE2_SPTR mark; /* Most recent mark on the success path */
835+
PCRE2_SPTR recurse_last_used; /* Last character used at time of pattern recursion */
836+
uint32_t current_recurse; /* Group number of current (deepest) pattern recursion */
837+
uint32_t capture_last; /* Most recent capture */
838+
PCRE2_SIZE last_group_offset; /* Saved offset to most recent group frame */
839+
PCRE2_SIZE offset_top; /* Offset after highest capture */
840+
PCRE2_SIZE ovector[131072]; /* Must be last in the structure */
839841
} heapframe;
840842

841843
/* This typedef is a check that the size of the heapframe structure is a
@@ -877,7 +879,7 @@ typedef struct match_block {
877879
uint16_t name_count; /* Number of names in name table */
878880
uint16_t name_entry_size; /* Size of entry in names table */
879881
PCRE2_SPTR name_table; /* Table of group names */
880-
PCRE2_SPTR start_code; /* For use when recursing */
882+
PCRE2_SPTR start_code; /* For use in pattern recursion */
881883
PCRE2_SPTR start_subject; /* Start of the subject string */
882884
PCRE2_SPTR check_subject; /* Where UTF-checked from */
883885
PCRE2_SPTR end_subject; /* Usable end of the subject string */
@@ -889,7 +891,7 @@ typedef struct match_block {
889891
PCRE2_SPTR nomatch_mark; /* Mark pointer to pass back on failure */
890892
PCRE2_SPTR verb_ecode_ptr; /* For passing back info */
891893
PCRE2_SPTR verb_skip_ptr; /* For passing back a (*SKIP) name */
892-
uint32_t verb_current_recurse; /* Current recurse when (*VERB) happens */
894+
uint32_t verb_current_recurse; /* Current recursion group when (*VERB) happens */
893895
uint32_t moptions; /* Match options */
894896
uint32_t poptions; /* Pattern options */
895897
uint32_t skip_arg_count; /* For counting SKIP_ARGs */
@@ -929,7 +931,7 @@ typedef struct dfa_match_block {
929931
pcre2_callout_block *cb; /* Points to a callout block */
930932
void *callout_data; /* To pass back to callouts */
931933
int (*callout)(pcre2_callout_block *,void *); /* Callout function or NULL */
932-
dfa_recursion_info *recursive; /* Linked list of recursion data */
934+
dfa_recursion_info *recursive; /* Linked list of pattern recursion data */
933935
} dfa_match_block;
934936

935937
#endif /* PCRE2_PCRE2TEST */

src/pcre2_match.c

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5383,9 +5383,11 @@ fprintf(stderr, "++ %2ld op=%3d %s\n", Fecode - mb->start_code, *Fecode,
53835383

53845384

53855385
/* ===================================================================== */
5386-
/* Recursion either matches the current regex, or some subexpression. The
5387-
offset data is the offset to the starting bracket from the start of the
5388-
whole pattern. (This is so that it works from duplicated subpatterns.) */
5386+
/* Pattern recursion either matches the current regex, or some
5387+
subexpression. The offset data is the offset to the starting bracket from
5388+
the start of the whole pattern. This is so that it works from duplicated
5389+
subpatterns. For a whole-pattern recursion, we have to infer the number
5390+
zero. */
53895391

53905392
#define Lframe_type F->temp_32[0]
53915393
#define Lstart_branch F->temp_sptr[0]
@@ -5394,9 +5396,10 @@ fprintf(stderr, "++ %2ld op=%3d %s\n", Fecode - mb->start_code, *Fecode,
53945396
bracode = mb->start_code + GET(Fecode, 1);
53955397
number = (bracode == mb->start_code)? 0 : GET2(bracode, 1 + LINK_SIZE);
53965398

5397-
/* If we are already in a recursion, check for repeating the same one
5398-
without advancing the subject pointer. This should catch convoluted mutual
5399-
recursions. (Some simple cases are caught at compile time.) */
5399+
/* If we are already in a pattern recursion, check for repeating the same
5400+
one without changing the subject pointer or the last referenced character
5401+
in the subject. This should catch convoluted mutual recursions. (Some
5402+
simple cases are caught at compile time.) */
54005403

54015404
if (Fcurrent_recurse != RECURSE_UNSET)
54025405
{
@@ -5407,15 +5410,18 @@ fprintf(stderr, "++ %2ld op=%3d %s\n", Fecode - mb->start_code, *Fecode,
54075410
P = (heapframe *)((char *)N - frame_size);
54085411
if (N->group_frame_type == (GF_RECURSE | number))
54095412
{
5410-
if (Feptr == P->eptr) return PCRE2_ERROR_RECURSELOOP;
5413+
if (Feptr == P->eptr && mb->last_used_ptr == P->recurse_last_used)
5414+
return PCRE2_ERROR_RECURSELOOP;
54115415
break;
54125416
}
54135417
offset = P->last_group_offset;
54145418
}
54155419
}
54165420

5417-
/* Now run the recursion, branch by branch. */
5421+
/* Remember the current last referenced character and then run the
5422+
recursion branch by branch. */
54185423

5424+
F->recurse_last_used = mb->last_used_ptr;
54195425
Lstart_branch = bracode;
54205426
Lframe_type = GF_RECURSE | number;
54215427

testdata/testinput1

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6648,4 +6648,7 @@ $/x
66486648
a[]b
66496649
(a)(?(1)a|b|c)
66506650

6651+
/^..A(*SKIP)B|C/
6652+
12ADC
6653+
66516654
# End of testinput1

0 commit comments

Comments
 (0)