Non-recursive scan prefix in JIT #560

zherczeg · 2024-11-12T11:29:04Z

The algorithm is rewritten. It should cover more cases and non-recursive.

Testing is difficult though. When it covers cases that it should not, the matching should fail, so that can be tested at some level. However, the opposite, when it does not cover cases, that it should, the effect is just slower matching.

Fixes #558 since it is non-recursive.

NWilson · 2024-11-13T08:02:21Z

I didn't have time to look at this yesterday; but I will today.

NWilson

Sorry, I've spent quite a lot of time on this, and typed a lot of silly questions. I'm just going to pause reviewing for the day.

src/pcre2_jit_compile.c

NWilson · 2024-11-13T13:47:05Z

src/pcre2_jit_compile.c

 }

-static int scan_prefix(compiler_common *common, PCRE2_SPTR cc, fast_forward_char_data *chars, int max_chars, sljit_u32 *rec_count)
+#define SCAN_PREFIX_STACK_END 32


Do we need a comment explaining the choice of "32"? Just, "big enough that we think we'll find what we want"? As I understand it, when you have parentheses and quantifiers it'll push to the stack, so we're bargaining here that very few patterns will have 12 characters surrounded by more 32 metacharacters!

It is a random value. Should be big enough in practice. The stack usage is quite low in most cases, but:
/(a|){33}b/ or /(a?){33}/can exhaust it. In this case no prefix is computed.

NWilson · 2024-11-13T13:52:49Z

src/pcre2_jit_compile.c

+      continue;
+      }
+
+    cc_stack[stack_ptr] = cc + len;


I don't fully understand what the old code was doing here. The old code did its scan_prefix recursion before continuing with last = FALSE; break; which that processes the character at position cc. So the recursion will process the following OPs before it the caller finishes processing the OP_QUERY.

But then, eventually, the caller will process the following data, again? I don't understand. What happens for x?yz? The OP_QUERY for x? will recurse with scan_prefix to process the yz, but when that returns, it will then go ahead and process yz again? That would be a quadratic cost, if I'm reading the code correctly.

I don't really understand the way the recursion has been converted into a stack either. The old code would process the data at cc + len first, then when that frame returns, continue processing from cc. But the new code continues processing from cc, and then when the stack is popped, it will move on (or back?) to cc + len.

If the old order of operations was occurring, I'd have expected the code to do something like this instead:

// Jump into cc + len, to implement the recursive call into scan_prefix(cc + len) cc_stack[stack_ptr] = cc; cc = cc + len; continue;

NWilson · 2024-11-13T14:45:37Z

src/pcre2_jit_compile.c

-static int scan_prefix(compiler_common *common, PCRE2_SPTR cc, fast_forward_char_data *chars, int max_chars, sljit_u32 *rec_count)
+#define SCAN_PREFIX_STACK_END 32
+
+static int scan_prefix(compiler_common *common, PCRE2_SPTR cc, fast_forward_char_data *chars)


The signature could have a comment explaining that chars must have length MAX_N_CHARS, or it could have type fast_forward_char_data chars[MAX_N_CHARS].

NWilson · 2024-11-13T15:29:26Z

src/pcre2_jit_compile.c

+
+static int scan_prefix(compiler_common *common, PCRE2_SPTR cc, fast_forward_char_data *chars)
 {
-/* Recursive function, which scans prefix literals. */


It took me an embarrassingly long time to work out what a "prefix literal" meant. I think I get it now: you just want to determine the set of characters which could start the pattern, or return an empty set if there's an unknowable set?

So for example, (x|yz) the "prefixes" are x, y (not z). And in (x...)?z the prefixes are x, z.

The old code is completely messing with my head - the flow of control is mega-convoluted to follow. So I should perhaps stop worrying about proving whether the behaviour is unchanged, and instead just review whether the new code implements the prefix concept accurately.

NWilson · 2024-11-13T15:42:20Z

src/pcre2_jit_compile.c


+  SLJIT_ASSERT(chars < chars_end);
+
  if (any)


Here's a suggestion: as far as I can see, two out of three of the uses of recursion in the original code were completely unnecessary.

There was OP_QUERY, OP_CRSTAR/OP_CRQUERY, and OP_BRA which do recursion. The first two don't need it!

For example, OP_QUERY could be:

switch (*cc) { case OP_QUERY: add_prefix_chars(*cc, caseless, repeat=1); cc += len; continue;

Something like that. Basically, take the blocks of code underneath the switch (if (any) ... and if (class) ... and the code for handling a single character). Lift those out into functions. Then you can process the OP_QUERY immediately, and just carry on to the next opcode, no need for any funky recursion tricks.

zherczeg · 2024-11-13T18:04:55Z

I will try to explain the algorithm using examples. The prefix is a simplified pattern, which has a fixed length, and it is a string of small character sets and dots in regexp terms.

Example:
/(abc|xbyd)/ prefix is [ax]b[cy] (3 characters long)
/a[a-z]b+c/ prefix is a.b (3 characters long) (I decided that + may introduce too many combinations, and not worth to continue. Never proved that it is true or false, just my feelings.)
/ab?cd/ prefix is a[bc][cd] (3 characters long)
/(ab|cd)|(ef|gh)/ prefix is [aceg][bdfh] (2 characters long)

Why this prefix is collected: jit can generate a simd code, which can search ...[ab]...[cd] patterns very fast (currently disabled on Windows though). So in the prefix we search for maximum two character sets (prefer single characters if possible), where the two sets has no common character. In my experiences the chance of such pairs are very low in a real text, so an ab is quite rare, but a " " (double space) is frequent. We even prefer an " [ax]" over " ", although the latter is "simpler".

The scan_prefix scans the pattern in a simple loop. However, if an | or character? is encountered, we save its location on the stack, and continue to the ending ')'. This way we can parse /a(bc|cb|xy)e/ as a[bcx][cby]e.

I hope this explains how the algorithm works. In jit, everything is rather complex unfortunately. This gives its speed, but quite challenging for understanding.

Let me know if you need more explanation.

NWilson · 2024-11-13T22:26:48Z

Ouch! I knew my understanding of the code's intentions didn't match what the code was doing, but I couldn't work out its purpose from just reading the loop.

That's really helpful explanation. I should be able to match that behaviour to the code now.

zherczeg · 2024-11-14T04:37:00Z

I have added the examples as a comment to the function. Let me know if more changes are needed.

zherczeg · 2024-11-15T03:52:36Z

I plan to land this patch soon, then #559, then the rest. Please let me know if this needs more changes.

NWilson

Sweet, I've spent another hour or two mentally stepping through the old code, and the new code, and I understand them both fully now.

They both appear correct, and should have the same behaviour.

Sorry to take a while. This was a good learning opportunity for me, since it made me learn more about how some of the OP codes worked, for some that I hadn't needed to look at yet.

NWilson · 2024-11-15T10:23:31Z

src/pcre2_jit_compile.c

+  if (chars >= chars_end)
+    {
+    if (stack_ptr == 0)
+      return chars_end - chars_start;


This chars_end - chars_start probably needs an explicit cast to int to silence warnings (even though we know it's always < 12 the compiler won't work it out always).

NWilson · 2024-11-15T10:30:59Z

src/pcre2_jit_compile.c

-    while (--repeat > 0);
+    while (--repeat > 0 && chars < chars_end);

    repeat = 1;


There are several places that do repeat = 1, and I think you need to remember to do that in all the places that do a continue. That seems error-prone. Couldn't we move all the repeat = 1 assignments up to the start of the loop, where last/caseless/etc are initialized?

Unfortunately not.

https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_jit_compile.c#L5883

For combined opcodes, we just record the repeat, and continue the main loop. Then we process the actual opcode with the recorded repeat.

E.g.: OP_TYPEEXACT followed by OP_ANY.

Maybe a goto could be used here, but I try to not use goto when possible.

Ouch! I wasn't aware of that. OK.

I wonder why OP_TYPEEXACT works differently to OP_CRSTAR (precedes the character matcher, rather than follows it). Oh well. Perhaps those opcodes could have been implemented as a common "one character suffix repetition", so that \d+ and [0-9]+ had more similar compiled representation.

Most of these choices are made based on how it is easier / faster to process them in the interpreter. Philip probably could tell more about it.

NWilson · 2024-11-15T10:45:52Z

src/pcre2_jit_compile.c

+    while (--repeat > 0 && chars < chars_end);

+    repeat = 1;
    switch (*cc)


You could merge this switch with the one just above. The switch statement could then set last, and do the if (last) chars_end = chars assignment at the end. Then the code to handle CRSTAR/CRQUERY/CRRANGE wouldn't be split across two switches, and it would look more similar to the OP_START/OP_QUERY/OP_RANGE code in the main switch.

Totally optional though, and not needed to merge.

This is also a smart suggestion. I do it.

NWilson · 2024-11-15T10:54:40Z

src/pcre2_jit_compile.c

+        SLJIT_ASSERT(stack_ptr < SCAN_PREFIX_STACK_END);
+        cc_stack[stack_ptr] = alternative;
+        chars_stack[stack_ptr] = chars;
+        next_alternative_stack[stack_ptr] = 1;


The assignments to chars_stack and next_alternative_stack should be setting the same values that we just read.

That's just an observation. I don't have a preference between keeping the redundant assignments as they are here, or converting them to a debug assertion.

I haven't noticed that. This is a quite good observation.

PhilipHazel · 2024-11-15T16:15:10Z

Most of these choices are made based on how it is easier / faster to process them in the interpreter. Philip probably could tell more about it.

When I first wrote PCRE in 1997, Perl regular expressions were a lot simpler than they are today. I'm actually quite gratified to know that we've managed to keep on upgrading the code over nearly 30 years. The original set of opcodes was invented "off the top of my head" but has been changed over the years. For example, at first I had a "string" opcode rather than OP_CHAR. That had its problems; this is from the ChangeLog for 2.07:

Fixed bug: a zero repetition after a literal string (e.g. /abcde{0}/) was
causing the entire string to be ignored, instead of just the last character.

It was not until release 5.0 (2004) that the string op was abolished in favour of OP_CHAR. (Reading the PCRE1 ChangeLog can be quite instructive.) Note that JIT support didn't come along until 2011 (release 8.20), a little bit after I got rid of tracking options at runtime (there used to be OP_OPT) and introduced OP_CHARI and other caseful/caseless pairs in 8.13.

So yes, it was the interpreter that influenced all these early decisions.

@zherczeg is also correct in saying that not much attention has been paid to the exact values of error offsets.

zherczeg force-pushed the scan_prefix branch 3 times, most recently from 0cc80ce to 494c24f Compare November 12, 2024 13:11

zherczeg mentioned this pull request Nov 13, 2024

Implement Perl extended character classes #553

Merged

NWilson reviewed Nov 13, 2024

View reviewed changes

zherczeg force-pushed the scan_prefix branch from 494c24f to e5eecef Compare November 13, 2024 18:09

zherczeg force-pushed the scan_prefix branch from e5eecef to 028ab04 Compare November 14, 2024 04:36

zherczeg force-pushed the scan_prefix branch from 028ab04 to e729fc0 Compare November 14, 2024 04:55

NWilson approved these changes Nov 15, 2024

View reviewed changes

zherczeg force-pushed the scan_prefix branch 2 times, most recently from 6ac748e to 486adb5 Compare November 15, 2024 11:31

Non-recursive scan prefix in JIT

3511587

zherczeg force-pushed the scan_prefix branch from 486adb5 to 3511587 Compare November 15, 2024 11:56

zherczeg merged commit 6f2da25 into PCRE2Project:master Nov 15, 2024

zherczeg deleted the scan_prefix branch November 15, 2024 12:21

Non-recursive scan prefix in JIT #560

Non-recursive scan prefix in JIT #560

Uh oh!

Conversation

zherczeg commented Nov 12, 2024

Uh oh!

NWilson commented Nov 13, 2024

Uh oh!

NWilson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zherczeg commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NWilson commented Nov 13, 2024

Uh oh!

zherczeg commented Nov 14, 2024

Uh oh!

zherczeg commented Nov 15, 2024

Uh oh!

NWilson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PhilipHazel commented Nov 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zherczeg commented Nov 13, 2024 •

edited

Loading