You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
libfsm compiles regular expressions to deterministic finite state machines (FSMs) and generates executable code. FSM-based matching runs in **linear time O(n)** with **no backtracking**.
4
4
5
+
> Regex engines like PCRE use backtracking to explore multiple possible match paths at **runtime**.
6
+
> This means the same pattern can have different execution costs depending on the input.
7
+
>
8
+
> libfsm instead resolves all match decisions at **compile time** by constructing a Deterministic Finite Automaton (DFA).
9
+
> At runtime, matching is a single linear pass over the input with no alternative paths to explore.
10
+
>
11
+
> As a result, libfsm avoids input-dependent slowdowns and is not susceptible to regular expression–based denial-of-service (ReDoS) attacks.
12
+
5
13
**libfsm is not a drop-in replacement for traditional regex engines.** It only supports patterns that can be compiled to FSMs.
> Adding code generation for new languages is template-driven and straightforward.
61
+
> Adding code generation for new languages is straightforward and is defined in [src/libfsm/print/](../src/libfsm/print/).
44
62
45
63
---
46
64
47
65
## Workflow Overview
48
66
49
-
libfsm provides two main tools: **`re`** takes patterns from command line, **`rx`** takes patterns from file.
67
+
libfsm provides two main tools:
68
+
-**`re`** takes patterns from command line
69
+
-**`rx`** takes patterns from file
70
+
71
+
A recommended workflow when using libfsm is:
50
72
51
73
### 1. Validate the Regex
52
74
53
-
Test behavior using any PCRE-compatible tool (e.g., [https://regex101.com/](https://regex101.com/)).
75
+
Test behavior using any PCRE-compatible tool (e.g., [pcregrep(1)](https://man7.org/linux/man-pages/man1/pcregrep.1.html) on the CLI or [https://regex101.com/](https://regex101.com/) in the browser).
54
76
55
77
### 2. Verify libfsm Compatibility
56
78
57
79
```bash
58
80
re -r pcre -l ast 'x*?'
59
81
# Output: /x*?/:3: Unsupported operator
82
+
# :3 indicates that the character at offset 3 in the pattern is rejected.
60
83
61
84
rx -r pcre -l ast -d declined.txt 'x*?'
62
85
# Unsupported character in declined.txt
@@ -67,7 +90,7 @@ If unsupported constructs exist, libfsm reports the failing location.
67
90
### 3. Generate Code
68
91
69
92
```bash
70
-
re -p -r pcre -l rust -k str '^item-[A-Z]{3}$'> item_detector.rs
93
+
re -p -r pcre -l rust -k str '^item-[A-Z]{3}\z'> item_detector.rs
|`-r`|Regex dialect |`pcre`, `literal`, `glob`, `native`, `sql`|`pcre` supports the widest set of features|
118
+
|`-l`|Output language for printing |`go`, `rust`, `vmc`, `llvm`, `wasm`, `dot`| Use `vmc` for `C` code. Pipe`dot` into `idot` for visualization |
119
+
|`-k`| Generated function I/O API |`str`, `getc`, `pair`|`str` takes string, `pair` takes byte array, `getc` uses callback for streaming |
120
+
|`-p`|Print mode |*(no value)*|Abbrv. of `-l fsm`. Print the constructed fsm, rather than executing it.|
121
+
|`-d`|Declined | filename | Only applies to `rx` (batch mode)|
99
122
100
-
For more detailed information on flags, see [include/fsm/options.h](../include/fsm/options.h) and the man pages (by running `build/man/re.1/re.1` after `bmake doc`).
123
+
This is not exhausted list. For full flag details, see [include/fsm/options.h](../include/fsm/options.h) and the [man pages](../man).
124
+
The man pages can be built by running `bmake doc`, then view with `build/man/re.1/re.1`.
101
125
102
126
---
103
127
104
128
## Writing Effective libfsm Patterns
105
129
106
-
For additional regex best practices, see [Fastly's regex guide](https://www.fastly.com/documentation/reference/vcl/regex/#best-practices-and-common-mistakes).
107
-
108
130
### 1. Replace Broad Wildcards
109
131
110
-
Avoid `.*` whenever possible. Use negated character classes:
132
+
Avoid `.*` and `.+` when possible. Wildcards match “anything,” which is often imprecise and forces libfsm to build a large DFA.
133
+
134
+
For example, a double-quoted string should not use `".*?"` because the content cannot contain an unescaped quote.
135
+
Instead, restrict it to the actual valid characters `"[^"\r\n]*"`, which matches only what is allowd and will keep the DFA more compact.
136
+
137
+
Use negated character classes:
111
138
112
139
| Avoid | Better |
113
140
| ---------- | -------------- |
114
141
|`<.*>`|`<[^>]*>`|
115
142
|`\((.*)\)`|`\([^)]*\)`|
116
-
|`price=.*`|`price=[0-9]+`|
143
+
|`price=.+`|`price=[0-9]+`|
144
+
|`var\s.+=`|`var\s[^=]+=`|
145
+
146
+
> This is often the cause of an “explosion” in the size of the generated FSM.
147
+
>
148
+
> See [Compilation Takes Too Long](#compilation-takes-too-long) for more details.
117
149
118
150
---
119
151
120
-
### 2. Anchor When You Require Full Matches
152
+
### 2. Anchor When Matching Full String
121
153
122
-
FSMs only do what’s specified. Explicitly anchor when matching entire strings:
154
+
When the intention is to match an entire string, use anchors.
155
+
Use `^` at the beginning and `\z` for the true end of the string.
123
156
124
157
```regex
125
-
^task-[a-z]+-[0-9]{2}\z
158
+
# Correct: matches only this exact hostname
159
+
^web\d+\.example\.com\z
160
+
161
+
# Incorrect: would match inside a larger string
162
+
web\d+\.example\.com # also matches "foo-web12.example.com-bar"
126
163
```
127
164
128
-
Use `\z` for end-of-string.
165
+
---
166
+
167
+
### 3. Prefer `\z` Over `$` for End-of-String
168
+
169
+
`\z` always matches the end of the string.
170
+
`$` will also match a trailing newline at the end of the string,
171
+
so if you use this in combination with capturing groups, you may not be capturing what you expect.
172
+
Also, `\z` is more efficient, so it is better to use it in places where `\n` cannot appear.
173
+
174
+
```regex
175
+
# Preferred
176
+
/foo\z
177
+
178
+
# Risky: $ may allow an extra newline
179
+
/foo$
180
+
```
129
181
130
182
---
131
183
132
-
##Byte Search Optimization (Optional)
184
+
### 4. Escape Special Characters When Used As Literal
133
185
134
-
Patterns that start with an **uncommon character** can be accelerated using an initial byte scan before running the FSM.
135
-
This quickly jumps to likely match positions instead of scanning every byte.
186
+
Many characters have special meaning in regex (for example `.`, `+`, `*`, `?`, `[`, `(`).
187
+
If you mean to match them literally, escape them:
136
188
137
-
### Common fast byte search APIs
189
+
```regex
190
+
# Correct: dot is treated literally
191
+
example\.com
192
+
193
+
# Incorrect: dot matches any character
194
+
example.com. # also matches "exampleXcom"
195
+
```
196
+
197
+
---
198
+
199
+
### 5. Use Non-Capturing Groups
200
+
201
+
Capture groups are _currently_ not supported (coming soon!).
202
+
If you need grouping for alternation or precedence, use non-capturing syntax `(?:...)`:
203
+
204
+
```regex
205
+
# Correct
206
+
(?:private|no-store)
138
207
139
-
| Language | Function |
140
-
| -------- | -------------------------- |
141
-
| Go |`strings.IndexByte`|
142
-
| Rust |`memchr::memchr`|
143
-
| C |`memchr` from `<string.h>`|
208
+
# Unsupported
209
+
(private|no-store)
210
+
```
144
211
212
+
---
145
213
146
-
### Good candidates
214
+
##Byte Search Optimization (Optional)
147
215
148
-
Patterns that always start with uncommon prefix characters, for example:
216
+
Patterns that start with an **uncommon character** can be accelerated using an initial byte scan before running the FSM.
217
+
This quickly jumps to likely match positions instead of scanning every byte.
218
+
219
+
Good candidates are patterns that start with uncommon prefix characters, for example:
149
220
150
221
```
151
222
#tag-[a-z]+
@@ -155,7 +226,9 @@ Patterns that always start with uncommon prefix characters, for example:
155
226
"name='[^']+'"
156
227
```
157
228
158
-
These prefixes (`#`, `@`, `[`, `{`, `'`, `"`) rarely appear in normal text, making a byte search highly effective.
229
+
These prefixes (`#`, `@`, `[`, `{`, `'`, `"`) are rare in normal text, so a byte search can skip ahead before running the matcher.
230
+
231
+
We found using `strings.IndexByte` before calling the generated matcher in Go code significantly improved performance when matching strings with a large (>5k) leading prefix.
159
232
160
233
---
161
234
@@ -171,16 +244,20 @@ Pattern:
171
244
172
245
Will compile to code that always returns true.
173
246
247
+
This is only an issue if that is not what you intend.
248
+
174
249
**Fix options:**
175
250
176
251
* Require at least one match: `\s+`
177
-
* Anchor context: `^\s+$`
178
-
* Or alternatively, use `-Fb` flag
252
+
* Anchor context: `^\s+$` or alternatively, use `-Fb` flag
179
253
180
254
### Compilation Takes Too Long
181
255
182
-
Likely caused by unrestricted wildcards (`.*`, `.+`). Fix with:
256
+
This is often caused by unrestricted wildcards (`.*`, `.+`).
257
+
Although they look compact, libfsm must enumerate every possible byte and every possible continuation, causing the state machine to grow quickly.
258
+
259
+
For example, to match `var anything =`, a pattern such as `var\s.+=` looks simple, but `.+` forces libfsm to encode every possible byte
260
+
and every possible continuation -- including both the presence and absence of `=`. This drastically increases the number of states.
183
261
184
-
* Negated classes (`[^)]*`)
185
-
* Bounded repeats (`{0,50}`)
186
-
* Pattern splitting
262
+
When compilation is slow, look for broad wildcards and replace them with more specific character classes (as shown [above](#writing-effective-libfsm-patterns)),
0 commit comments