Skip to content

Conversation

@zherczeg
Copy link
Collaborator

The algorithm is rewritten. It should cover more cases and non-recursive.

Testing is difficult though. When it covers cases that it should not, the matching should fail, so that can be tested at some level. However, the opposite, when it does not cover cases, that it should, the effect is just slower matching.

Fixes #558 since it is non-recursive.

@zherczeg zherczeg force-pushed the scan_prefix branch 3 times, most recently from 0cc80ce to 494c24f Compare November 12, 2024 13:11
@NWilson
Copy link
Member

NWilson commented Nov 13, 2024

I didn't have time to look at this yesterday; but I will today.

Copy link
Member

@NWilson NWilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I've spent quite a lot of time on this, and typed a lot of silly questions. I'm just going to pause reviewing for the day.

}

static int scan_prefix(compiler_common *common, PCRE2_SPTR cc, fast_forward_char_data *chars, int max_chars, sljit_u32 *rec_count)
#define SCAN_PREFIX_STACK_END 32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a comment explaining the choice of "32"? Just, "big enough that we think we'll find what we want"? As I understand it, when you have parentheses and quantifiers it'll push to the stack, so we're bargaining here that very few patterns will have 12 characters surrounded by more 32 metacharacters!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a random value. Should be big enough in practice. The stack usage is quite low in most cases, but:
/(a|){33}b/ or /(a?){33}/can exhaust it. In this case no prefix is computed.

continue;
}

cc_stack[stack_ptr] = cc + len;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand what the old code was doing here. The old code did its scan_prefix recursion before continuing with last = FALSE; break; which that processes the character at position cc. So the recursion will process the following OPs before it the caller finishes processing the OP_QUERY.

But then, eventually, the caller will process the following data, again? I don't understand. What happens for x?yz? The OP_QUERY for x? will recurse with scan_prefix to process the yz, but when that returns, it will then go ahead and process yz again? That would be a quadratic cost, if I'm reading the code correctly.

I don't really understand the way the recursion has been converted into a stack either. The old code would process the data at cc + len first, then when that frame returns, continue processing from cc. But the new code continues processing from cc, and then when the stack is popped, it will move on (or back?) to cc + len.

If the old order of operations was occurring, I'd have expected the code to do something like this instead:

// Jump into cc + len, to implement the recursive call into scan_prefix(cc + len)
cc_stack[stack_ptr] = cc;
cc = cc + len;
continue;

static int scan_prefix(compiler_common *common, PCRE2_SPTR cc, fast_forward_char_data *chars, int max_chars, sljit_u32 *rec_count)
#define SCAN_PREFIX_STACK_END 32

static int scan_prefix(compiler_common *common, PCRE2_SPTR cc, fast_forward_char_data *chars)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signature could have a comment explaining that chars must have length MAX_N_CHARS, or it could have type fast_forward_char_data chars[MAX_N_CHARS].


static int scan_prefix(compiler_common *common, PCRE2_SPTR cc, fast_forward_char_data *chars)
{
/* Recursive function, which scans prefix literals. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me an embarrassingly long time to work out what a "prefix literal" meant. I think I get it now: you just want to determine the set of characters which could start the pattern, or return an empty set if there's an unknowable set?

So for example, (x|yz) the "prefixes" are x, y (not z). And in (x...)?z the prefixes are x, z.

The old code is completely messing with my head - the flow of control is mega-convoluted to follow. So I should perhaps stop worrying about proving whether the behaviour is unchanged, and instead just review whether the new code implements the prefix concept accurately.


SLJIT_ASSERT(chars < chars_end);

if (any)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a suggestion: as far as I can see, two out of three of the uses of recursion in the original code were completely unnecessary.

There was OP_QUERY, OP_CRSTAR/OP_CRQUERY, and OP_BRA which do recursion. The first two don't need it!

For example, OP_QUERY could be:

switch (*cc) {

case OP_QUERY:
add_prefix_chars(*cc, caseless, repeat=1);
cc += len;
continue;

Something like that. Basically, take the blocks of code underneath the switch (if (any) ... and if (class) ... and the code for handling a single character). Lift those out into functions. Then you can process the OP_QUERY immediately, and just carry on to the next opcode, no need for any funky recursion tricks.

@zherczeg
Copy link
Collaborator Author

zherczeg commented Nov 13, 2024

I will try to explain the algorithm using examples. The prefix is a simplified pattern, which has a fixed length, and it is a string of small character sets and dots in regexp terms.

Example:
/(abc|xbyd)/ prefix is [ax]b[cy] (3 characters long)
/a[a-z]b+c/ prefix is a.b (3 characters long) (I decided that + may introduce too many combinations, and not worth to continue. Never proved that it is true or false, just my feelings.)
/ab?cd/ prefix is a[bc][cd] (3 characters long)
/(ab|cd)|(ef|gh)/ prefix is [aceg][bdfh] (2 characters long)

Why this prefix is collected: jit can generate a simd code, which can search ...[ab]...[cd] patterns very fast (currently disabled on Windows though). So in the prefix we search for maximum two character sets (prefer single characters if possible), where the two sets has no common character. In my experiences the chance of such pairs are very low in a real text, so an ab is quite rare, but a " " (double space) is frequent. We even prefer an " [ax]" over " ", although the latter is "simpler".

The scan_prefix scans the pattern in a simple loop. However, if an | or character? is encountered, we save its location on the stack, and continue to the ending ')'. This way we can parse /a(bc|cb|xy)e/ as a[bcx][cby]e.

I hope this explains how the algorithm works. In jit, everything is rather complex unfortunately. This gives its speed, but quite challenging for understanding.

Let me know if you need more explanation.

@NWilson
Copy link
Member

NWilson commented Nov 13, 2024

Ouch! I knew my understanding of the code's intentions didn't match what the code was doing, but I couldn't work out its purpose from just reading the loop.

That's really helpful explanation. I should be able to match that behaviour to the code now.

@zherczeg
Copy link
Collaborator Author

I have added the examples as a comment to the function. Let me know if more changes are needed.

@zherczeg
Copy link
Collaborator Author

I plan to land this patch soon, then #559, then the rest. Please let me know if this needs more changes.

Copy link
Member

@NWilson NWilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, I've spent another hour or two mentally stepping through the old code, and the new code, and I understand them both fully now.

They both appear correct, and should have the same behaviour.

Sorry to take a while. This was a good learning opportunity for me, since it made me learn more about how some of the OP codes worked, for some that I hadn't needed to look at yet.

if (chars >= chars_end)
{
if (stack_ptr == 0)
return chars_end - chars_start;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This chars_end - chars_start probably needs an explicit cast to int to silence warnings (even though we know it's always < 12 the compiler won't work it out always).

while (--repeat > 0);
while (--repeat > 0 && chars < chars_end);

repeat = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several places that do repeat = 1, and I think you need to remember to do that in all the places that do a continue. That seems error-prone. Couldn't we move all the repeat = 1 assignments up to the start of the loop, where last/caseless/etc are initialized?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not.

https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_jit_compile.c#L5883

For combined opcodes, we just record the repeat, and continue the main loop. Then we process the actual opcode with the recorded repeat.

E.g.: OP_TYPEEXACT followed by OP_ANY.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a goto could be used here, but I try to not use goto when possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch! I wasn't aware of that. OK.

I wonder why OP_TYPEEXACT works differently to OP_CRSTAR (precedes the character matcher, rather than follows it). Oh well. Perhaps those opcodes could have been implemented as a common "one character suffix repetition", so that \d+ and [0-9]+ had more similar compiled representation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these choices are made based on how it is easier / faster to process them in the interpreter. Philip probably could tell more about it.

while (--repeat > 0 && chars < chars_end);

repeat = 1;
switch (*cc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could merge this switch with the one just above. The switch statement could then set last, and do the if (last) chars_end = chars assignment at the end. Then the code to handle CRSTAR/CRQUERY/CRRANGE wouldn't be split across two switches, and it would look more similar to the OP_START/OP_QUERY/OP_RANGE code in the main switch.

Totally optional though, and not needed to merge.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a smart suggestion. I do it.

SLJIT_ASSERT(stack_ptr < SCAN_PREFIX_STACK_END);
cc_stack[stack_ptr] = alternative;
chars_stack[stack_ptr] = chars;
next_alternative_stack[stack_ptr] = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assignments to chars_stack and next_alternative_stack should be setting the same values that we just read.

That's just an observation. I don't have a preference between keeping the redundant assignments as they are here, or converting them to a debug assertion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't noticed that. This is a quite good observation.

@zherczeg zherczeg force-pushed the scan_prefix branch 2 times, most recently from 6ac748e to 486adb5 Compare November 15, 2024 11:31
@zherczeg zherczeg merged commit 6f2da25 into PCRE2Project:master Nov 15, 2024
@zherczeg zherczeg deleted the scan_prefix branch November 15, 2024 12:21
@PhilipHazel
Copy link
Collaborator

Most of these choices are made based on how it is easier / faster to process them in the interpreter. Philip probably could tell more about it.

When I first wrote PCRE in 1997, Perl regular expressions were a lot simpler than they are today. I'm actually quite gratified to know that we've managed to keep on upgrading the code over nearly 30 years. The original set of opcodes was invented "off the top of my head" but has been changed over the years. For example, at first I had a "string" opcode rather than OP_CHAR. That had its problems; this is from the ChangeLog for 2.07:

  1. Fixed bug: a zero repetition after a literal string (e.g. /abcde{0}/) was
    causing the entire string to be ignored, instead of just the last character.

It was not until release 5.0 (2004) that the string op was abolished in favour of OP_CHAR. (Reading the PCRE1 ChangeLog can be quite instructive.) Note that JIT support didn't come along until 2011 (release 8.20), a little bit after I got rid of tracking options at runtime (there used to be OP_OPT) and introduced OP_CHARI and other caseful/caseless pairs in 8.13.

So yes, it was the interpreter that influenced all these early decisions.

@zherczeg is also correct in saying that not much attention has been paid to the exact values of error offsets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

scan_prefix has no effective recursion limit

3 participants