-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathregex-internals.h
More file actions
279 lines (242 loc) · 10.5 KB
/
regex-internals.h
File metadata and controls
279 lines (242 loc) · 10.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
/*
===============================================================================
librex-ast - A PCRE2-Compatible Regex Engine
Author: Mounir IDRASSI <mounir.idrassi@amcrypto.jp>
Date: July 19, 2025
License: MIT
Description:
------------
This file is part of a high-performance, feature-rich, and PCRE2-compatible
regular expression engine written in C. The library implements both a
sophisticated parser and a bytecode execution engine (virtual machine) to
provide a complete compile-and-match solution. It is designed for
portability, performance, and API clarity, with extensive support for
modern regex features including Unicode properties, advanced grouping,
and recursive patterns.
Key Architectural Features:
---------------------------
- Two-Stage Compilation:
1. A recursive descent parser builds a detailed Abstract Syntax Tree (AST)
from the regex pattern.
2. An AST-to-bytecode compiler translates the tree into a linear, compact
instruction set for the VM.
- NFA-based Virtual Machine (VM):
* A custom VM executes the compiled bytecode to perform the match.
* Implements a non-recursive, stack-based backtracking NFA algorithm.
Alternative execution paths (NFA states) are managed on an explicit stack.
* Uses a "visited" set for memoization to prevent redundant work and handle
complex patterns with overlapping subproblems efficiently.
- Pluggable Memory Management:
* Core API supports custom allocators ('malloc', 'realloc', 'free'),
allowing integration into projects with specific memory strategies.
* The parser uses an internal arena allocator for efficient AST node
management during compilation.
- Comprehensive PCRE2 Compatibility:
* Supports a wide array of advanced constructs found in PCRE2 and Perl.
* The implementation is validated by a test suite covering syntax, matching,
edge cases, and error conditions.
- Detailed Error Reporting:
* Provides structured error objects with error codes, messages, and the
exact line/column number of the error in the pattern.
- Unicode-Awareness:
* Full UTF-8 support in both the parser and the matching engine.
* Built-in support for Unicode property matching (\p, \P) using a
partial, internal Unicode character database to generate efficient bitmaps.
Implementation Details:
-----------------------
- Parser:
* Recursive descent with two-phase fixup for resolving forward references
(e.g., '\k<name>' before '(?<name>...)').
* Detailed tracking of parser state, including capture counts, named groups,
and inline flag modifiers.
* Semantic validation, including fixed-width checks for lookbehind assertions.
- AST-to-Bytecode Compiler:
* Translates the AST into a simple and efficient instruction set (e.g.,
CHAR, ANY, SPLIT, JMP, SAVE, CALL).
* Capturing groups are compiled into self-contained, callable subroutines
invoked via dedicated I_CALL and I_RETURN instructions.
- NFA Virtual Machine (VM):
* The core matching logic is a loop processing VM instructions.
* Backtracking is managed by pushing alternative execution paths (threads)
onto a stack.
* Instructions for advanced features like atomic groups ('I_MARK_ATOMIC',
'I_CUT_TO_MARK'), conditionals ('I_GCOND'), and assertions ('I_ACOND', 'I_LBCOND').
- Unicode:
* Safe, single-pass UTF-8 decoding.
* Unicode property matching uses a built-in table of character ranges
to build bitmaps. These bitmaps are allocated in the AST's memory arena
for efficient cleanup.
* Unified character class builder handles standard classes ('[a-z]'),
shorthands ('\d', '\w'), and POSIX classes ('[[:digit:]]') in a
Unicode-aware manner.
- API:
* Clean, two-stage API ('regex_compile', 'regex_match', 'regex_free').
* Opaque 'regex_compiled*' handle encapsulates the compiled pattern.
* Match results are returned in a structured, easy-to-use format.
Supported Regex Constructs:
---------------------------
Basic Elements:
- Literal characters (full Unicode support)
- Character classes '[abc]', '[^abc]', '[a-z]'
- Predefined classes: '\d', '\D', '\w', '\W', '\s', '\S'
- Dot metacharacter '.' (respects single-line mode)
- Anchors: '^', '$', '\A', '\z', '\b', '\B'
Quantifiers:
- Greedy: '*', '+', '?', '{n}', '{n,}', '{n,m}'
- Lazy (Non-greedy): '*?', '+?', '??', '{n,m}?'
- Possessive: '*+', '++', '?+', '{n,m}+'
Groups:
- Capturing groups: '(...)'
- Non-capturing groups: '(?:...)'
- Named groups: '(?<name>...)', '(?'name'...)'
- Atomic groups: '(?>...)'
- Branch-reset groups: '(?|...)'
Assertions:
- Positive lookahead: '(?=...)'
- Negative lookahead: '(?!...)'
- Positive lookbehind: '(?<=...)'
- Negative lookbehind: '(?<!...)'
Backreferences:
- Numbered: '\1', '\2', etc.
- Named: '\k<name>', '\k'name''
Conditionals:
- By group number: '(?(1)yes|no)'
- By group name: '(?(<name>)yes|no)'
- By assertion: '(?(?=...)yes|no)'
Subroutines:
- Full pattern recursion: '(?R)'
- By group number: '(?1)', '(?2)', etc.
- By group name: '(?&name)'
Modifiers & Comments:
- Inline flags: '(?i)', '(?-m)', etc.
- Scoped flags: '(?i:...)'
- Comments: '(?#...)'
Unicode & Escapes:
- UTF-8 input processing and validation.
- Unicode properties: '\p{L}', '\P{Sc}', etc.
- Hex escapes: '\x20', '\x{1F600}'
- Quoted sequences: '\Q...\E'
- Partial POSIX support (common classes like [[:alpha:]], Unicode-aware where database allows).
Current Limitations:
--------------------
- No AST or bytecode optimization passes are currently performed.
- Lookbehind assertions must be fixed-length (variable-length lookbehind is not supported).
- Maximum lookbehind length is 255 characters (PCRE2 compatible).
- The built-in Unicode property support is based on a partial character database and does not cover all scripts or categories.
- Recursion/subroutine depth limited to 32 (MAX_CALL_DEPTH).
- POSIX classes partially supported and Unicode-aware only for covered properties.
- No full grapheme matching or script runs.
- No support for '\g{...}' backreference/subroutine syntax (use '\k<>', '(?n)' instead).
- No support for script runs or grapheme clusters ('\X').
- No support for generic newline sequences ('\R').
- No support for control verbs like '(*SKIP)', '(*FAIL)', '(*ACCEPT)'.
- No support for callouts.
===============================================================================
*/
#ifndef REGEX_INTERNALS_H
#define REGEX_INTERNALS_H
#include "regex-parser.h"
//==============================================================================
//
// PRIVATE INTERNALS - DO NOT USE DIRECTLY
//
//==============================================================================
#define MAX_CACHED_PROPERTIES 32
typedef struct Block {
void *data;
size_t used;
size_t cap;
struct Block *next;
} Block;
typedef struct AstArena {
Block *blocks;
size_t total_allocated;
regex_allocator allocator;
struct {
char* name; // Or use a dynamic array if needed
uint32_t* bitmap;
} property_cache[MAX_CACHED_PROPERTIES];
int property_cache_count;
} AstArena;
// --- Bytecode Instructions ---
typedef enum {
I_END, I_CHAR, I_ANY, I_SPLIT, I_JMP, I_SAVE,
I_RANGE, I_UNIPROP,
I_BOUND, /* word‑boundary (\b) */
I_NBOUND, /* non‑word‑boundary (\B) */
I_SBOL, I_SEOL, /* begin/end of subject (\A, \z) */
I_BOL, I_EOL, I_MATCH,
I_BACKREF, I_GCOND,
I_ACOND, I_ASUCCESS, I_CALL, I_RETURN,
I_MARK_ATOMIC, I_CUT_TO_MARK,
I_LBCOND,
I_MBOL, I_MEOL
} IType;
typedef struct {
uint8_t op;
uintptr_t val; /* char / index / bit pattern */
int32_t x; /* addr1 for JMP / SPLIT / sub_pattern_pc */
int32_t y; /* addr2 for SPLIT / no_branch_pc */
} Instr;
typedef struct {
int group_index;
char *group_name;
size_t start_pc;
} SubroutineDef;
typedef struct {
SubroutineDef *defs;
int count;
int capacity;
} SubroutineTable;
// --- Internal data structures ---
typedef enum {
REGEX_NODE_CHAR, REGEX_NODE_DOT, REGEX_NODE_ANCHOR, REGEX_NODE_CHAR_CLASS, REGEX_NODE_CONCAT,
REGEX_NODE_ALTERNATION, REGEX_NODE_QUANTIFIER, REGEX_NODE_GROUP, REGEX_NODE_BACKREF, REGEX_NODE_ASSERTION,
REGEX_NODE_COMMENT, REGEX_NODE_UNI_PROP, REGEX_NODE_BRESET_GROUP, REGEX_NODE_CONDITIONAL, REGEX_NODE_SUBROUTINE
} RegexNodeType;
typedef enum {
ASSERT_LOOKAHEAD_POS, ASSERT_LOOKAHEAD_NEG,
ASSERT_LOOKBEHIND_POS, ASSERT_LOOKBEHIND_NEG
} AssertionType;
typedef enum { QUANT_GREEDY, QUANT_LAZY, QUANT_POSSESSIVE } QuantifierType;
typedef enum { COND_INVALID = 0, COND_NUMERIC, COND_NAMED, COND_ASSERTION } ConditionType;
typedef struct Condition {
ConditionType type;
union {
int group_index;
char *group_name;
struct RegexNode *assertion;
} data;
} Condition;
typedef struct RegexNode {
RegexNodeType type;
int token_start;
int token_end;
union {
uint32_t codepoint;
char anchor_type;
struct { char *set; bool negated; bool is_posix; } char_class;
struct { struct RegexNode *left; struct RegexNode *right; } children;
struct { struct RegexNode *child; int min; int max; QuantifierType quant_type; } quantifier;
struct { struct RegexNode *child; int capture_index; char *name; bool is_atomic; uint32_t enter_flags; uint32_t exit_flags; } group;
struct { int ref_index; char *ref_name; } backref;
struct { struct RegexNode *child; AssertionType assert_type; } assertion;
struct { bool negated; char *prop_name; } uni_prop;
struct { Condition cond; struct RegexNode *if_true; struct RegexNode *if_false; } conditional;
struct { bool is_recursion; int target_index; char *target_name; } subroutine;
} data;
} RegexNode;
// The full, internal definition of the compiled regex object.
struct regex_compiled {
RegexNode* ast; // The AST (kept for debugging/inspection)
AstArena* arena; // Arena for AST and compile-time data
uint32_t flags;
int capture_count;
regex_allocator allocator;
// --- Pre-compiled Bytecode ---
Instr* code; // The compiled bytecode
size_t pc; // Number of instructions in the bytecode
};
// Internal function to perform the AST -> Bytecode compilation step.
int compile_regex_to_bytecode(struct regex_compiled* rx, regex_err* error);
#endif // REGEX_INTERNALS_H