Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ jobs:

- name: Test (main test script)
run: |
ulimit -S -s 32768 # Raise stack limit; ASAN with -O0 is very stack-hungry
ulimit -S -s 49152 # Raise stack limit; ASAN with -O0 is very stack-hungry
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The fuzz tests generate very deep stacks. The stack usage for ordinary builds is quite sensible (15 * 32 bytes), but the stack frames and stack allocations have absurdly-large amounts of guard space in the ASAN builds, so I had to bump this higher.

The eclass code (with a 15-deep parenthesis limit) uses similar or less stack than the main regex parser, which is also recursive-descent with a 255-deep limit.

./RunTest

- name: Test (JIT test program)
Expand Down
37 changes: 37 additions & 0 deletions HACKING
Original file line number Diff line number Diff line change
Expand Up @@ -633,6 +633,43 @@ When XCL_NOT is set, the bit map, if present, contains bits for characters that
are allowed (exactly as for OP_NCLASS), but the list of items that follow it
specifies characters and properties that are not allowed.

The meaning of the bitmap indicated by XCL_MAP is that, if one is present, then
it fully describes which code points < 256 match the class (without needing to
invert the check according to XCL_NOT); the other items in the OP_XCLASS need
not be consulted. However, if a bitmap is not present, then code points < 256
may still match, so the other items in the OP_XCLASS must be consulted.

For classes containing logical expressions, such as "[\p{Greek} && \p{Lu}]" for
"uppercase Greek letters", OP_ECLASS is used. The expression is encoded as a a
stack-based series of operands and operators, in Reverse Polish Notation. Like
an OP_XCLASS, the OP_ECLASS is first followed by a LINK_SIZE value containing
the total length of the opcode and its data. That is followed by a code unit
containing flags: currently just ECL_MAP indicating that a bit map is present.
There follows the bit map, if ECL_MAP is set. Finally, there is a sequence of
items that are either an operand or operator. Each item starts with a single
code unit containing its type:

ECL_AND AND; no additional data
ECL_OR OR; no additional data
ECL_XOR XOR; no additional data
ECL_NOT NOT; no additional data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mentioned that not is removed. Is there a special type (sequence) for all/nothing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't mentioned them in the HACKING document, because they are not part of a fully-formed OP_ECLASS, but yes, there are two values ECL_ANY and ECL_NONE.

When one of these appears as the LHS/RHS of a binary operator, it can be constant-folded away. Therefore, in the final return value from compile_class_nested(), the result either has no instances of ECL_ANY or ECL_NONE, or the result is a single ECL_ANY or ECL_NONE item. In this case, we don't need to wrap it in an OP_ECLASS, so at match-time, an OP_ECLASS will only ever contain ECL_XCLASS items plus operators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, very unfortunately... I had to back the NOT opcode. There's one case I discovered that I couldn't fold, and I reworked the code a few times, but in the end, it was better with the NOT.

This is the case: [[\p{Thai} & \p{Digits}] ~~ [^z]]. If you have a compound expression, on the LHS of an XOR (~~), then we need to be able to fold the RHS into the LHS. However, because the RHS is all ones, we need to flip (invert) the LHS. At this point, it's too late to go back and do that - I really want the whole optimisation pass to be a single, left-to-right pass over the META codes, without needing backtracking or fixups.

By far the easiest solution is just to place a NOT at the end. In theory... it could be folded away, with more effort, but the code just ends up so ugly I decided not to do that.

ECL_XCLASS The additional data which follows ECL_XCLASS is the same as for
an OP_XCLASS, except that this data is preceded by ECL_XCLASS
rather than OP_XCLASS.
Because the OP_ECLASS has its own bitmap (if required), an
ECL_XCLASS should not contain a bitmap.

Additionally, there are two intermediate values used during compilation, but
these are folded away during generation of the opcode, and so never appear
inside an OP_ECLASS at match time. They are:

ECL_ANY match all characters; no additional data
ECL_NONE match no characters; no additional data

The meaning of the bitmap indicated by ECL_MAP is different to that of XCL_MAP
for OP_XCLASS, in one way. The ECL_MAP bitmap is present whenever any code
points < 256 match the class.


Back references
---------------
Expand Down
2 changes: 1 addition & 1 deletion doc/pcre2test.1
Original file line number Diff line number Diff line change
Expand Up @@ -524,7 +524,7 @@ it is preferred to use \eN{U+hh...} when describing characters. When testing
the 8-bit library not in UTF-8 mode, \ex{hh} generates one byte for values
that could fit on it, and causes an error for greater values.
.P
When testing te 16-bit library, not in UTF-16 mode, all 4-digit \ex{hhhh}
When testing the 16-bit library, not in UTF-16 mode, all 4-digit \ex{hhhh}
values are accepted. This makes it possible to construct invalid UTF-16
sequences for testing purposes.
.P
Expand Down
24 changes: 13 additions & 11 deletions src/pcre2_auto_possess.c
Original file line number Diff line number Diff line change
Expand Up @@ -480,13 +480,13 @@ switch(c)

case OP_NCLASS:
case OP_CLASS:
#ifdef SUPPORT_WIDE_CHARS
case OP_XCLASS:
case OP_ECLASS:
/* TODO: [EC] https://github.com/PCRE2Project/pcre2/issues/537
Add back the "ifdef SUPPORT_WIDE_CHARS" once we stop emitting ECLASS for this case. */
if (c == OP_XCLASS || c == OP_ECLASS)
end = code + GET(code, 0) - 1;
else
#endif
end = code + 32 / sizeof(PCRE2_UCHAR);
class_end = end;

Expand Down Expand Up @@ -1118,17 +1118,15 @@ for(;;)
list_ptr[2] + LINK_SIZE, (const uint8_t*)cb->start_code, utf))
return FALSE;
break;
#endif

/* TODO: [EC] https://github.com/PCRE2Project/pcre2/issues/537
Enclose in "ifdef SUPPORT_WIDE_CHARS" once we stop emitting ECLASS for this case. */
case OP_ECLASS:
if (PRIV(eclass)(chr,
(list_ptr == list ? code : base_end) - list_ptr[2] + LINK_SIZE,
(list_ptr == list ? code : base_end) - list_ptr[3],
(const uint8_t*)cb->start_code, utf))
return FALSE;
break;
#endif /* SUPPORT_WIDE_CHARS */

default:
return FALSE;
Expand Down Expand Up @@ -1236,13 +1234,17 @@ for (;;)
}
c = *code;
}
else if (c == OP_CLASS || c == OP_NCLASS || c == OP_XCLASS || c == OP_ECLASS)
else if (c == OP_CLASS || c == OP_NCLASS
#ifdef SUPPORT_WIDE_CHARS
|| c == OP_XCLASS || c == OP_ECLASS
#endif
)
{
/* TODO: [EC] https://github.com/PCRE2Project/pcre2/issues/537
Add back the "ifdef SUPPORT_WIDE_CHARS" once we stop emitting ECLASS for this case. */
#ifdef SUPPORT_WIDE_CHARS
if (c == OP_XCLASS || c == OP_ECLASS)
repeat_opcode = code + GET(code, 1);
else
#endif
repeat_opcode = code + 1 + (32 / sizeof(PCRE2_UCHAR));

c = *repeat_opcode;
Expand Down Expand Up @@ -1315,12 +1317,12 @@ for (;;)
code += GET(code, 1 + 2*LINK_SIZE);
break;

/* TODO: [EC] https://github.com/PCRE2Project/pcre2/issues/537
Add back the "ifdef SUPPORT_WIDE_CHARS" once we stop emitting ECLASS for this case. */
case OP_ECLASS:
#ifdef SUPPORT_WIDE_CHARS
case OP_XCLASS:
case OP_ECLASS:
code += GET(code, 1);
break;
#endif

case OP_MARK:
case OP_COMMIT_ARG:
Expand Down
Loading
Loading