You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/GUIDE.md
+46-75Lines changed: 46 additions & 75 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,13 +2,13 @@
2
2
3
3
libfsm compiles regular expressions to deterministic finite state machines (FSMs) and generates executable code. FSM-based matching runs in **linear time O(n)** with **no backtracking**.
4
4
5
-
> Regex engines like PCRE use backtracking to explore multiple possible match paths at **runtime**.
6
-
> This means the same pattern can have different execution costs depending on the input.
7
-
>
8
-
> libfsm instead resolves all match decisions at **compile time** by constructing a Deterministic Finite Automaton (DFA).
9
-
> At runtime, matching is a single linear pass over the input with no alternative paths to explore.
10
-
>
11
-
> As a result, libfsm avoids input-dependent slowdowns and is not susceptible to regular expression–based denial-of-service (ReDoS) attacks.
5
+
Regex engines like PCRE use backtracking to explore multiple possible match paths at **runtime**.
6
+
This means the same pattern can have different execution costs depending on the input.
7
+
8
+
libfsm instead resolves all match decisions at **compile time** by constructing a Deterministic Finite Automaton (DFA).
9
+
At runtime, matching is a single linear pass over the input with no alternative paths to explore.
10
+
11
+
As a result, libfsm avoids input-dependent slowdowns and is not susceptible to regular expression–based denial-of-service (ReDoS) attacks.
12
12
13
13
**libfsm is not a drop-in replacement for traditional regex engines.** It only supports patterns that can be compiled to FSMs.
14
14
@@ -21,6 +21,7 @@ libfsm compiles regular expressions to deterministic finite state machines (FSMs
| High-level languages |**C (via `-l vmc`), Go, Rust**|
58
-
| Toolchains |**LLVM IR**|
59
-
| Virtualization |**Native WebAssembly**|
60
-
61
-
> Adding code generation for new languages is straightforward and is defined in [src/libfsm/print/](../src/libfsm/print/).
62
-
63
-
---
55
+
Adding code generation for new languages is straightforward and is defined in [src/libfsm/print/](../src/libfsm/print/).
64
56
65
57
## Workflow Overview
66
58
@@ -70,11 +62,11 @@ libfsm provides two main tools:
70
62
71
63
A recommended workflow when using libfsm is:
72
64
73
-
### 1. Validate the Regex
65
+
1. Validate the Regex
74
66
75
67
Test behavior using any PCRE-compatible tool (e.g., [pcregrep(1)](https://man7.org/linux/man-pages/man1/pcregrep.1.html) on the CLI or [https://regex101.com/](https://regex101.com/) in the browser).
|`-r`| Regex dialect |`pcre`, `literal`, `glob`, `native`, `sql`|`pcre` supports the widest set of features |
118
108
|`-l`| Output language for printing |`go`, `rust`, `vmc`, `llvm`, `wasm`, `dot`| Use `vmc` for `C` code. Pipe `dot` into `idot` for visualization |
119
109
|`-k`| Generated function I/O API |`str`, `getc`, `pair`|`str` takes string, `pair` takes byte array, `getc` uses callback for streaming |
120
110
|`-p`| Print mode |*(no value)*| Abbrv. of `-l fsm`. Print the constructed fsm, rather than executing it. |
121
-
|`-d`| Declined | filename | Only applies to `rx` (batch mode) |
111
+
|`-d`| Declined patterns| filename | Only applies to `rx` (batch mode) |
122
112
123
113
This is not exhausted list. For full flag details, see [include/fsm/options.h](../include/fsm/options.h) and the [man pages](../man).
124
-
The man pages can be built by running `bmake doc`, then view with `build/man/re.1/re.1`.
125
-
126
-
---
114
+
The man pages can be built by running `bmake -r doc`, then view with `build/man/re.1/re.1`.
127
115
128
116
## Writing Effective libfsm Patterns
129
117
130
-
### 1. Replace Broad Wildcards
118
+
1. Replace Broad Wildcards
119
+
120
+
Avoid `.*` and `.+` when possible. Wildcards match “anything,” which is often imprecise. And although they look compact, libfsm must enumerate every possible byte and continuation. This quickly leads to large DFAs.
131
121
132
-
Avoid `.*` and `.+` when possible. Wildcards match “anything,” which is often imprecise and forces libfsm to build a large DFA.
122
+
For example, a double-quoted string should not use `".*"` because the content cannot contain an unescaped quote. Using `.*` forces libfsm to consider all characters -- including both the presence and absence of the closing `"` at every step. This greatly increases the number of states.
133
123
134
-
For example, a double-quoted string should not use `".*?"` because the content cannot contain an unescaped quote.
135
-
Instead, restrict it to the actual valid characters `"[^"\r\n]*"`, which matches only what is allowd and will keep the DFA more compact.
124
+
Instead, restrict it to the actual valid characters `"[^"\r\n]*"`, which matches only what is allowed and will keep the DFA more compact.
136
125
137
-
Use negated character classes:
126
+
Use negated character classes to match only the allowed content:
138
127
139
128
| Avoid | Better |
140
129
| ---------- | -------------- |
@@ -143,45 +132,46 @@ Use negated character classes:
143
132
|`price=.+`|`price=[0-9]+`|
144
133
|`var\s.+=`|`var\s[^=]+=`|
145
134
146
-
> This is often the cause of an “explosion” in the size of the generated FSM.
147
-
>
148
-
> See [Compilation Takes Too Long](#compilation-takes-too-long) for more details.
135
+
The overlap between `.*` or `.+` and strings that follow is often the cause of an “explosion” in the size of the generated FSM. So when compilation is slow or generated output is large, look for `.*` and `.+` first and replace them with a narrower character class.
149
136
150
-
---
151
-
152
-
### 2. Anchor When Matching Full String
137
+
2. Anchor When Matching Full String
153
138
154
139
When the intention is to match an entire string, use anchors.
155
140
Use `^` at the beginning and `\z` for the true end of the string.
156
141
157
142
```regex
158
143
# Correct: matches only this exact hostname
144
+
# Matches "web12.example.com"
145
+
# Does not match "foo-web12.example.com-bar"
159
146
^web\d+\.example\.com\z
160
147
161
148
# Incorrect: would match inside a larger string
162
-
web\d+\.example\.com # also matches "foo-web12.example.com-bar"
149
+
# Matches "web12.example.com"
150
+
# Also matches "foo-web12.example.com-bar"
151
+
web\d+\.example\.com
163
152
```
164
153
165
-
---
166
-
167
-
### 3. Prefer `\z` Over `$` for End-of-String
154
+
3. Prefer `\z` Over `$` for End-of-String
168
155
169
156
`\z` always matches the end of the string.
170
157
`$` will also match a trailing newline at the end of the string,
171
158
so if you use this in combination with capturing groups, you may not be capturing what you expect.
172
-
Also, `\z`is more efficient, so it is better to use it in places where `\n` cannot appear.
159
+
Also, `\z`produces a smaller FSM, so it is better to use it in places where `\n` cannot appear.
173
160
174
161
```regex
175
-
# Preferred
176
-
/foo\z
177
-
178
-
# Risky: $ may allow an extra newline
179
-
/foo$
162
+
# Preferred: matches only if the string ends with "bar"
163
+
# Matches "/foo/bar"
164
+
# Does NOT match "/foo/bar\n"
165
+
/bar\z
166
+
167
+
# Incorrect: allows a trailing newline,
168
+
# which is usually unintended and adds unnecessary complexity
169
+
# Matches "/foo/bar"
170
+
# Also matches "/foo/bar\n"
171
+
/bar$
180
172
```
181
173
182
-
---
183
-
184
-
### 4. Escape Special Characters When Used As Literal
174
+
4. Escape Special Characters When Used As Literal
185
175
186
176
Many characters have special meaning in regex (for example `.`, `+`, `*`, `?`, `[`, `(`).
187
177
If you mean to match them literally, escape them:
@@ -195,9 +185,7 @@ If you mean to match them literally, escape them:
195
185
|`(test)`|`\(test\)`|`(` and `)` begin/end a group |
196
186
| Markdown link `[t](u)`|`(\[[^]]*\]\([^)]*\))`| Matches `[text](url)` without crossing `]` or `)`|
197
187
198
-
---
199
-
200
-
### 5. Use Non-Capturing Groups
188
+
5. Use Non-Capturing Groups
201
189
202
190
Capture groups are _currently_ not supported (coming soon!).
203
191
If you need grouping for alternation or precedence, use non-capturing syntax `(?:...)`:
@@ -210,8 +198,6 @@ If you need grouping for alternation or precedence, use non-capturing syntax `(?
210
198
(private|no-store)
211
199
```
212
200
213
-
---
214
-
215
201
## Byte Search Optimization (Optional)
216
202
217
203
Patterns that start with an **uncommon character** can be accelerated using an initial byte scan before running the FSM.
@@ -231,11 +217,7 @@ These prefixes (`#`, `@`, `[`, `{`, `'`, `"`) are rare in normal text, so a byte
231
217
232
218
We found using `strings.IndexByte` before calling the generated matcher in Go code significantly improved performance when matching strings with a large (>5k) leading prefix.
233
219
234
-
---
235
-
236
-
## Troubleshooting
237
-
238
-
### Pattern Matches Empty String Unintentionally
220
+
## Pattern Matches Empty String Unintentionally
239
221
240
222
Pattern:
241
223
@@ -251,14 +233,3 @@ This is only an issue if that is not what you intend.
251
233
252
234
* Require at least one match: `\s+`
253
235
* Anchor context: `^\s+$` or alternatively, use `-Fb` flag
254
-
255
-
### Compilation Takes Too Long
256
-
257
-
This is often caused by unrestricted wildcards (`.*`, `.+`).
258
-
Although they look compact, libfsm must enumerate every possible byte and every possible continuation, causing the state machine to grow quickly.
259
-
260
-
For example, to match `var anything =`, a pattern such as `var\s.+=` looks simple, but `.+` forces libfsm to encode every possible byte
261
-
and every possible continuation -- including both the presence and absence of `=`. This drastically increases the number of states.
262
-
263
-
When compilation is slow, look for broad wildcards and replace them with more specific character classes (as shown [above](#writing-effective-libfsm-patterns)),
0 commit comments