-
Notifications
You must be signed in to change notification settings - Fork 243
Non-recursive scan prefix in JIT #560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0cc80ce to
494c24f
Compare
|
I didn't have time to look at this yesterday; but I will today. |
NWilson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I've spent quite a lot of time on this, and typed a lot of silly questions. I'm just going to pause reviewing for the day.
| } | ||
|
|
||
| static int scan_prefix(compiler_common *common, PCRE2_SPTR cc, fast_forward_char_data *chars, int max_chars, sljit_u32 *rec_count) | ||
| #define SCAN_PREFIX_STACK_END 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a comment explaining the choice of "32"? Just, "big enough that we think we'll find what we want"? As I understand it, when you have parentheses and quantifiers it'll push to the stack, so we're bargaining here that very few patterns will have 12 characters surrounded by more 32 metacharacters!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a random value. Should be big enough in practice. The stack usage is quite low in most cases, but:
/(a|){33}b/ or /(a?){33}/can exhaust it. In this case no prefix is computed.
| continue; | ||
| } | ||
|
|
||
| cc_stack[stack_ptr] = cc + len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand what the old code was doing here. The old code did its scan_prefix recursion before continuing with last = FALSE; break; which that processes the character at position cc. So the recursion will process the following OPs before it the caller finishes processing the OP_QUERY.
But then, eventually, the caller will process the following data, again? I don't understand. What happens for x?yz? The OP_QUERY for x? will recurse with scan_prefix to process the yz, but when that returns, it will then go ahead and process yz again? That would be a quadratic cost, if I'm reading the code correctly.
I don't really understand the way the recursion has been converted into a stack either. The old code would process the data at cc + len first, then when that frame returns, continue processing from cc. But the new code continues processing from cc, and then when the stack is popped, it will move on (or back?) to cc + len.
If the old order of operations was occurring, I'd have expected the code to do something like this instead:
// Jump into cc + len, to implement the recursive call into scan_prefix(cc + len)
cc_stack[stack_ptr] = cc;
cc = cc + len;
continue;
| static int scan_prefix(compiler_common *common, PCRE2_SPTR cc, fast_forward_char_data *chars, int max_chars, sljit_u32 *rec_count) | ||
| #define SCAN_PREFIX_STACK_END 32 | ||
|
|
||
| static int scan_prefix(compiler_common *common, PCRE2_SPTR cc, fast_forward_char_data *chars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The signature could have a comment explaining that chars must have length MAX_N_CHARS, or it could have type fast_forward_char_data chars[MAX_N_CHARS].
|
|
||
| static int scan_prefix(compiler_common *common, PCRE2_SPTR cc, fast_forward_char_data *chars) | ||
| { | ||
| /* Recursive function, which scans prefix literals. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me an embarrassingly long time to work out what a "prefix literal" meant. I think I get it now: you just want to determine the set of characters which could start the pattern, or return an empty set if there's an unknowable set?
So for example, (x|yz) the "prefixes" are x, y (not z). And in (x...)?z the prefixes are x, z.
The old code is completely messing with my head - the flow of control is mega-convoluted to follow. So I should perhaps stop worrying about proving whether the behaviour is unchanged, and instead just review whether the new code implements the prefix concept accurately.
|
|
||
| SLJIT_ASSERT(chars < chars_end); | ||
|
|
||
| if (any) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a suggestion: as far as I can see, two out of three of the uses of recursion in the original code were completely unnecessary.
There was OP_QUERY, OP_CRSTAR/OP_CRQUERY, and OP_BRA which do recursion. The first two don't need it!
For example, OP_QUERY could be:
switch (*cc) {
case OP_QUERY:
add_prefix_chars(*cc, caseless, repeat=1);
cc += len;
continue;
Something like that. Basically, take the blocks of code underneath the switch (if (any) ... and if (class) ... and the code for handling a single character). Lift those out into functions. Then you can process the OP_QUERY immediately, and just carry on to the next opcode, no need for any funky recursion tricks.
|
I will try to explain the algorithm using examples. The prefix is a simplified pattern, which has a fixed length, and it is a string of small character sets and dots in regexp terms. Example: Why this prefix is collected: jit can generate a simd code, which can search The scan_prefix scans the pattern in a simple loop. However, if an I hope this explains how the algorithm works. In jit, everything is rather complex unfortunately. This gives its speed, but quite challenging for understanding. Let me know if you need more explanation. |
494c24f to
e5eecef
Compare
|
Ouch! I knew my understanding of the code's intentions didn't match what the code was doing, but I couldn't work out its purpose from just reading the loop. That's really helpful explanation. I should be able to match that behaviour to the code now. |
e5eecef to
028ab04
Compare
|
I have added the examples as a comment to the function. Let me know if more changes are needed. |
028ab04 to
e729fc0
Compare
|
I plan to land this patch soon, then #559, then the rest. Please let me know if this needs more changes. |
NWilson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet, I've spent another hour or two mentally stepping through the old code, and the new code, and I understand them both fully now.
They both appear correct, and should have the same behaviour.
Sorry to take a while. This was a good learning opportunity for me, since it made me learn more about how some of the OP codes worked, for some that I hadn't needed to look at yet.
src/pcre2_jit_compile.c
Outdated
| if (chars >= chars_end) | ||
| { | ||
| if (stack_ptr == 0) | ||
| return chars_end - chars_start; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This chars_end - chars_start probably needs an explicit cast to int to silence warnings (even though we know it's always < 12 the compiler won't work it out always).
| while (--repeat > 0); | ||
| while (--repeat > 0 && chars < chars_end); | ||
|
|
||
| repeat = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several places that do repeat = 1, and I think you need to remember to do that in all the places that do a continue. That seems error-prone. Couldn't we move all the repeat = 1 assignments up to the start of the loop, where last/caseless/etc are initialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not.
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_jit_compile.c#L5883
For combined opcodes, we just record the repeat, and continue the main loop. Then we process the actual opcode with the recorded repeat.
E.g.: OP_TYPEEXACT followed by OP_ANY.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a goto could be used here, but I try to not use goto when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch! I wasn't aware of that. OK.
I wonder why OP_TYPEEXACT works differently to OP_CRSTAR (precedes the character matcher, rather than follows it). Oh well. Perhaps those opcodes could have been implemented as a common "one character suffix repetition", so that \d+ and [0-9]+ had more similar compiled representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these choices are made based on how it is easier / faster to process them in the interpreter. Philip probably could tell more about it.
src/pcre2_jit_compile.c
Outdated
| while (--repeat > 0 && chars < chars_end); | ||
|
|
||
| repeat = 1; | ||
| switch (*cc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could merge this switch with the one just above. The switch statement could then set last, and do the if (last) chars_end = chars assignment at the end. Then the code to handle CRSTAR/CRQUERY/CRRANGE wouldn't be split across two switches, and it would look more similar to the OP_START/OP_QUERY/OP_RANGE code in the main switch.
Totally optional though, and not needed to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a smart suggestion. I do it.
src/pcre2_jit_compile.c
Outdated
| SLJIT_ASSERT(stack_ptr < SCAN_PREFIX_STACK_END); | ||
| cc_stack[stack_ptr] = alternative; | ||
| chars_stack[stack_ptr] = chars; | ||
| next_alternative_stack[stack_ptr] = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assignments to chars_stack and next_alternative_stack should be setting the same values that we just read.
That's just an observation. I don't have a preference between keeping the redundant assignments as they are here, or converting them to a debug assertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't noticed that. This is a quite good observation.
6ac748e to
486adb5
Compare
486adb5 to
3511587
Compare
When I first wrote PCRE in 1997, Perl regular expressions were a lot simpler than they are today. I'm actually quite gratified to know that we've managed to keep on upgrading the code over nearly 30 years. The original set of opcodes was invented "off the top of my head" but has been changed over the years. For example, at first I had a "string" opcode rather than OP_CHAR. That had its problems; this is from the ChangeLog for 2.07:
It was not until release 5.0 (2004) that the string op was abolished in favour of OP_CHAR. (Reading the PCRE1 ChangeLog can be quite instructive.) Note that JIT support didn't come along until 2011 (release 8.20), a little bit after I got rid of tracking options at runtime (there used to be OP_OPT) and introduced OP_CHARI and other caseful/caseless pairs in 8.13. So yes, it was the interpreter that influenced all these early decisions. @zherczeg is also correct in saying that not much attention has been paid to the exact values of error offsets. |
The algorithm is rewritten. It should cover more cases and non-recursive.
Testing is difficult though. When it covers cases that it should not, the matching should fail, so that can be tested at some level. However, the opposite, when it does not cover cases, that it should, the effect is just slower matching.
Fixes #558 since it is non-recursive.