Skip to content

Commit 62a6498

Browse files
committed
Improve logic for setting starting code units by ignoring certain assertions.
1 parent 9de4d53 commit 62a6498

File tree

4 files changed

+99
-13
lines changed

4 files changed

+99
-13
lines changed

ChangeLog

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -184,13 +184,18 @@ PCRE2_CASELESS and PCRE2_UCP (but not PCRE2_UTF) were set. Fixed by not trying
184184
to look for other cases for characters above the Unicode range.
185185

186186
50. In caseless 32-bit mode with UCP (but not UTF) set, the character
187-
0xffffffff incorrectly matched any character that has more than one other case,
187+
0xffffffff incorrectly matched any character that has more than one other case,
188188
in particular k and s.
189189

190190
51. Fix accept and endanchored interaction in JIT.
191191

192192
52. Fix backreferences with unset backref and non-greedy iterators in JIT.
193193

194+
53. Improve the logic that checks for a list of starting code units -- positive
195+
lookahead assertions are now ignored if the immediately following item is one
196+
that sets a mandatory starting character. For example, /a?(?=bc|)d/ used to set
197+
all of a, b, and d as possible starting code units; now it sets only a and d.
198+
194199

195200
Version 10.42 11-December-2022
196201
------------------------------
@@ -214,12 +219,12 @@ maximum of 65535 is now silently applied.
214219

215220
5. Merged @carenas patch #175 which fixes #86 - segfault on aarch64 (ARM),
216221

217-
6. The prototype for pcre2_substring_list_free() specified its argument as
218-
PCRE2_SPTR * which is a const data type, whereas the yield from
219-
pcre2_substring_list() is not const. This caused compiler warnings. I have
220-
changed the argument of pcre2_substring_list_free() to be PCRE2_UCHAR ** to
221-
remove this anomaly. This might cause new warnings in existing code where a
222-
cast has been used to avoid previous ones.
222+
6. The prototype for pcre2_substring_list_free() specified its argument as
223+
PCRE2_SPTR * which is a const data type, whereas the yield from
224+
pcre2_substring_list() is not const. This caused compiler warnings. I have
225+
changed the argument of pcre2_substring_list_free() to be PCRE2_UCHAR ** to
226+
remove this anomaly. This might cause new warnings in existing code where a
227+
cast has been used to avoid previous ones.
223228

224229

225230
Version 10.41 06-December-2022

src/pcre2_study.c

Lines changed: 48 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -979,6 +979,7 @@ do
979979
while (try_next) /* Loop for items in this branch */
980980
{
981981
int rc;
982+
PCRE2_SPTR ncode;
982983
uint8_t *classmap = NULL;
983984
#ifdef SUPPORT_WIDE_CHARS
984985
PCRE2_UCHAR xclassflags;
@@ -1110,10 +1111,53 @@ do
11101111
tcode++;
11111112
break;
11121113

1113-
/* If we hit a bracket or a positive lookahead assertion, recurse to set
1114-
bits from within the subpattern. If it can't find anything, we have to
1115-
give up. If it finds some mandatory character(s), we are done for this
1116-
branch. Otherwise, carry on scanning after the subpattern. */
1114+
/* For a positive lookahead assertion, inspect what immediately follows.
1115+
If the next item is one that sets a mandatory character, skip this
1116+
assertion. Otherwise, treat it the same as other bracket groups. */
1117+
1118+
case OP_ASSERT:
1119+
case OP_ASSERT_NA:
1120+
ncode = tcode + GET(tcode, 1);
1121+
while (*ncode == OP_ALT) ncode += GET(ncode, 1);
1122+
ncode += 1 + LINK_SIZE;
1123+
switch(*ncode)
1124+
{
1125+
default:
1126+
break;
1127+
1128+
case OP_PROP:
1129+
if (ncode[1] != PT_CLIST) break;
1130+
/* Fall through */
1131+
case OP_ANYNL:
1132+
case OP_CHAR:
1133+
case OP_CHARI:
1134+
case OP_EXACT:
1135+
case OP_EXACTI:
1136+
case OP_HSPACE:
1137+
case OP_MINPLUS:
1138+
case OP_MINPLUSI:
1139+
case OP_PLUS:
1140+
case OP_PLUSI:
1141+
case OP_POSPLUS:
1142+
case OP_POSPLUSI:
1143+
case OP_VSPACE:
1144+
/* Note that these types will only be present in non-UCP mode. */
1145+
case OP_DIGIT:
1146+
case OP_NOT_DIGIT:
1147+
case OP_WORDCHAR:
1148+
case OP_NOT_WORDCHAR:
1149+
case OP_WHITESPACE:
1150+
case OP_NOT_WHITESPACE:
1151+
tcode = ncode;
1152+
continue; /* With the following opcode */
1153+
}
1154+
/* Fall through */
1155+
1156+
/* For a group bracket or a positive assertion without an immediately
1157+
following mandatory setting, recurse to set bits from within the
1158+
subpattern. If it can't find anything, we have to give up. If it finds
1159+
some mandatory character(s), we are done for this branch. Otherwise,
1160+
carry on scanning after the subpattern. */
11171161

11181162
case OP_BRA:
11191163
case OP_SBRA:
@@ -1125,8 +1169,6 @@ do
11251169
case OP_SCBRAPOS:
11261170
case OP_ONCE:
11271171
case OP_SCRIPT_RUN:
1128-
case OP_ASSERT:
1129-
case OP_ASSERT_NA:
11301172
rc = set_start_bits(re, tcode, utf, ucp, depthptr);
11311173
if (rc == SSB_DONE)
11321174
{

testdata/testinput2

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6075,4 +6075,16 @@ a)"xI
60756075
/(?:|(?0).)(?(R)|\z)/
60766076
abcd
60776077

6078+
/a?(?=b(*COMMIT)c|)d/I
6079+
bd
6080+
6081+
/(?=b(*COMMIT)c|)d/I
6082+
bd
6083+
6084+
/a?(?=b(*COMMIT)c|)d/I,no_start_optimize
6085+
bd
6086+
6087+
/(?=b(*COMMIT)c|)d/I,no_start_optimize
6088+
bd
6089+
60786090
# End of testinput2

testdata/testoutput2

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17977,6 +17977,33 @@ No match
1797717977
abcd
1797817978
0: abcd
1797917979

17980+
/a?(?=b(*COMMIT)c|)d/I
17981+
Capture group count = 0
17982+
Starting code units: a d
17983+
Last code unit = 'd'
17984+
Subject length lower bound = 1
17985+
bd
17986+
0: d
17987+
17988+
/(?=b(*COMMIT)c|)d/I
17989+
Capture group count = 0
17990+
First code unit = 'd'
17991+
Subject length lower bound = 1
17992+
bd
17993+
0: d
17994+
17995+
/a?(?=b(*COMMIT)c|)d/I,no_start_optimize
17996+
Capture group count = 0
17997+
Options: no_start_optimize
17998+
bd
17999+
No match
18000+
18001+
/(?=b(*COMMIT)c|)d/I,no_start_optimize
18002+
Capture group count = 0
18003+
Options: no_start_optimize
18004+
bd
18005+
No match
18006+
1798018007
# End of testinput2
1798118008
Error -70: PCRE2_ERROR_BADDATA (unknown error number)
1798218009
Error -62: bad serialized data

0 commit comments

Comments
 (0)