-
Notifications
You must be signed in to change notification settings - Fork 244
Add folding and simplication for OP_ECLASS #586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -633,6 +633,43 @@ When XCL_NOT is set, the bit map, if present, contains bits for characters that | |
| are allowed (exactly as for OP_NCLASS), but the list of items that follow it | ||
| specifies characters and properties that are not allowed. | ||
|
|
||
| The meaning of the bitmap indicated by XCL_MAP is that, if one is present, then | ||
| it fully describes which code points < 256 match the class (without needing to | ||
| invert the check according to XCL_NOT); the other items in the OP_XCLASS need | ||
| not be consulted. However, if a bitmap is not present, then code points < 256 | ||
| may still match, so the other items in the OP_XCLASS must be consulted. | ||
NWilson marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| For classes containing logical expressions, such as "[\p{Greek} && \p{Lu}]" for | ||
| "uppercase Greek letters", OP_ECLASS is used. The expression is encoded as a a | ||
| stack-based series of operands and operators, in Reverse Polish Notation. Like | ||
| an OP_XCLASS, the OP_ECLASS is first followed by a LINK_SIZE value containing | ||
| the total length of the opcode and its data. That is followed by a code unit | ||
| containing flags: currently just ECL_MAP indicating that a bit map is present. | ||
| There follows the bit map, if ECL_MAP is set. Finally, there is a sequence of | ||
| items that are either an operand or operator. Each item starts with a single | ||
| code unit containing its type: | ||
|
|
||
| ECL_AND AND; no additional data | ||
| ECL_OR OR; no additional data | ||
| ECL_XOR XOR; no additional data | ||
| ECL_NOT NOT; no additional data | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you mentioned that not is removed. Is there a special type (sequence) for all/nothing?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I hadn't mentioned them in the HACKING document, because they are not part of a fully-formed OP_ECLASS, but yes, there are two values ECL_ANY and ECL_NONE. When one of these appears as the LHS/RHS of a binary operator, it can be constant-folded away. Therefore, in the final return value from
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, very unfortunately... I had to back the NOT opcode. There's one case I discovered that I couldn't fold, and I reworked the code a few times, but in the end, it was better with the NOT. This is the case: By far the easiest solution is just to place a NOT at the end. In theory... it could be folded away, with more effort, but the code just ends up so ugly I decided not to do that. |
||
| ECL_XCLASS The additional data which follows ECL_XCLASS is the same as for | ||
| an OP_XCLASS, except that this data is preceded by ECL_XCLASS | ||
| rather than OP_XCLASS. | ||
| Because the OP_ECLASS has its own bitmap (if required), an | ||
| ECL_XCLASS should not contain a bitmap. | ||
|
|
||
| Additionally, there are two intermediate values used during compilation, but | ||
| these are folded away during generation of the opcode, and so never appear | ||
| inside an OP_ECLASS at match time. They are: | ||
|
|
||
| ECL_ANY match all characters; no additional data | ||
| ECL_NONE match no characters; no additional data | ||
|
|
||
| The meaning of the bitmap indicated by ECL_MAP is different to that of XCL_MAP | ||
| for OP_XCLASS, in one way. The ECL_MAP bitmap is present whenever any code | ||
| points < 256 match the class. | ||
|
|
||
|
|
||
| Back references | ||
| --------------- | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The fuzz tests generate very deep stacks. The stack usage for ordinary builds is quite sensible (15 * 32 bytes), but the stack frames and stack allocations have absurdly-large amounts of guard space in the ASAN builds, so I had to bump this higher.
The eclass code (with a 15-deep parenthesis limit) uses similar or less stack than the main regex parser, which is also recursive-descent with a 255-deep limit.