Skip to content

Commit d811399

Browse files
Update docs based on reviews
1 parent f2a661f commit d811399

File tree

1 file changed

+116
-39
lines changed

1 file changed

+116
-39
lines changed

doc/GUIDE.md

Lines changed: 116 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,33 @@
22

33
libfsm compiles regular expressions to deterministic finite state machines (FSMs) and generates executable code. FSM-based matching runs in **linear time O(n)** with **no backtracking**.
44

5+
> Regex engines like PCRE use backtracking to explore multiple possible match paths at **runtime**.
6+
> This means the same pattern can have different execution costs depending on the input.
7+
>
8+
> libfsm instead resolves all match decisions at **compile time** by constructing a Deterministic Finite Automaton (DFA).
9+
> At runtime, matching is a single linear pass over the input with no alternative paths to explore.
10+
>
11+
> As a result, libfsm avoids input-dependent slowdowns and is not susceptible to regular expression–based denial-of-service (ReDoS) attacks.
12+
513
**libfsm is not a drop-in replacement for traditional regex engines.** It only supports patterns that can be compiled to FSMs.
614

15+
### **Topics**
16+
17+
- [What libfsm Cannot Do](#what-libfsm-cannot-do)
18+
- [Quick Start](#quick-start)
19+
- [Supported Code Generation Targets](#supported-code-generation-targets)
20+
- [Workflow Overview](#workflow-overview)
21+
- [Writing Effective libfsm Patterns](#writing-effective-libfsm-patterns)
22+
- [Byte Search Optimization (Optional)](#byte-search-optimization-optional)
23+
- [Troubleshooting](#troubleshooting)
24+
725
## What libfsm Cannot Do
826

927
These PCRE features will not compile:
1028

1129
* Word boundaries (`\b`)
1230
* Non-greedy quantifiers (`*?`, `+?`, `??`)
13-
* Group capture and backreferences
31+
* Group capture (coming soon!) and backreferences
1432
* Lookahead/lookbehind assertions (`(?=`, `(?!`, `(?<=`, `(?<!`)
1533
* Conditional expressions (`(?(condition)then|else)`)
1634
* Recursion and subroutines (`(?R)`, `(?1)`)
@@ -40,23 +58,28 @@ libfsm provides stable, “first-class” code generation for:
4058
| Toolchains | **LLVM IR** |
4159
| Virtualization | **Native WebAssembly** |
4260

43-
> Adding code generation for new languages is template-driven and straightforward.
61+
> Adding code generation for new languages is straightforward and is defined in [src/libfsm/print/](../src/libfsm/print/).
4462
4563
---
4664

4765
## Workflow Overview
4866

49-
libfsm provides two main tools: **`re`** takes patterns from command line, **`rx`** takes patterns from file.
67+
libfsm provides two main tools:
68+
- **`re`** takes patterns from command line
69+
- **`rx`** takes patterns from file
70+
71+
A recommended workflow when using libfsm is:
5072

5173
### 1. Validate the Regex
5274

53-
Test behavior using any PCRE-compatible tool (e.g., [https://regex101.com/](https://regex101.com/)).
75+
Test behavior using any PCRE-compatible tool (e.g., [pcregrep(1)](https://man7.org/linux/man-pages/man1/pcregrep.1.html) on the CLI or [https://regex101.com/](https://regex101.com/) in the browser).
5476

5577
### 2. Verify libfsm Compatibility
5678

5779
```bash
5880
re -r pcre -l ast 'x*?'
5981
# Output: /x*?/:3: Unsupported operator
82+
# :3 indicates that the character at offset 3 in the pattern is rejected.
6083

6184
rx -r pcre -l ast -d declined.txt 'x*?'
6285
# Unsupported character in declined.txt
@@ -67,7 +90,7 @@ If unsupported constructs exist, libfsm reports the failing location.
6790
### 3. Generate Code
6891

6992
```bash
70-
re -p -r pcre -l rust -k str '^item-[A-Z]{3}$' > item_detector.rs
93+
re -p -r pcre -l rust -k str '^item-[A-Z]{3}\z' > item_detector.rs
7194
```
7295

7396
### 4. Multiple Patterns
@@ -89,63 +112,111 @@ Both tools:
89112
---
90113

91114
### Flag Reference
92-
| Flag | Purpose | Common Options | Notes |
93-
| ---- | --------------------------- | ------------------------------------------ | ------------------------------------------ |
94-
| `-r` | Select regex dialect | `pcre`, `literal`, `glob`, `native`, `sql` | `pcre` supports the widest set of features |
95-
| `-l` | Choose output language | `go`, `rust`, `vmc`, `llvm`, `wasm`, `dot` | Use `vmc` for `C` code, pipe `dot` into `idot` for visualization |
96-
| `-k` | Generated function I/O API | `str`, `getc`, `pair` | `str` takes string, `pair` takes byte array, `getc` uses callback for streaming |
97-
| `-p` | Production mode | *(no value)* | Generates optimized code |
98-
| `-d` | Output unsupported patterns | filename | Only applies to `rx` (batch mode) |
115+
| Flag | Purpose | Common Options | Notes |
116+
| ---- | ---------------------------- | ------------------------------------------ | ---------------------------------------------------------------- |
117+
| `-r` | Regex dialect | `pcre`, `literal`, `glob`, `native`, `sql` | `pcre` supports the widest set of features |
118+
| `-l` | Output language for printing | `go`, `rust`, `vmc`, `llvm`, `wasm`, `dot` | Use `vmc` for `C` code. Pipe `dot` into `idot` for visualization |
119+
| `-k` | Generated function I/O API | `str`, `getc`, `pair` | `str` takes string, `pair` takes byte array, `getc` uses callback for streaming |
120+
| `-p` | Print mode | *(no value)* | Abbrv. of `-l fsm`. Print the constructed fsm, rather than executing it. |
121+
| `-d` | Declined | filename | Only applies to `rx` (batch mode) |
99122

100-
For more detailed information on flags, see [include/fsm/options.h](../include/fsm/options.h) and the man pages (by running `build/man/re.1/re.1` after `bmake doc`).
123+
This is not exhausted list. For full flag details, see [include/fsm/options.h](../include/fsm/options.h) and the [man pages](../man).
124+
The man pages can be built by running `bmake doc`, then view with `build/man/re.1/re.1`.
101125

102126
---
103127

104128
## Writing Effective libfsm Patterns
105129

106-
For additional regex best practices, see [Fastly's regex guide](https://www.fastly.com/documentation/reference/vcl/regex/#best-practices-and-common-mistakes).
107-
108130
### 1. Replace Broad Wildcards
109131

110-
Avoid `.*` whenever possible. Use negated character classes:
132+
Avoid `.*` and `.+` when possible. Wildcards match “anything,” which is often imprecise and forces libfsm to build a large DFA.
133+
134+
For example, a double-quoted string should not use `".*?"` because the content cannot contain an unescaped quote.
135+
Instead, restrict it to the actual valid characters `"[^"\r\n]*"`, which matches only what is allowd and will keep the DFA more compact.
136+
137+
Use negated character classes:
111138

112139
| Avoid | Better |
113140
| ---------- | -------------- |
114141
| `<.*>` | `<[^>]*>` |
115142
| `\((.*)\)` | `\([^)]*\)` |
116-
| `price=.*` | `price=[0-9]+` |
143+
| `price=.+` | `price=[0-9]+` |
144+
| `var\s.+=` | `var\s[^=]+=` |
145+
146+
> This is often the cause of an “explosion” in the size of the generated FSM.
147+
>
148+
> See [Compilation Takes Too Long](#compilation-takes-too-long) for more details.
117149
118150
---
119151

120-
### 2. Anchor When You Require Full Matches
152+
### 2. Anchor When Matching Full String
121153

122-
FSMs only do what’s specified. Explicitly anchor when matching entire strings:
154+
When the intention is to match an entire string, use anchors.
155+
Use `^` at the beginning and `\z` for the true end of the string.
123156

124157
```regex
125-
^task-[a-z]+-[0-9]{2}\z
158+
# Correct: matches only this exact hostname
159+
^web\d+\.example\.com\z 
160+
161+
# Incorrect: would match inside a larger string
162+
web\d+\.example\.com # also matches "foo-web12.example.com-bar"
126163
```
127164

128-
Use `\z` for end-of-string.
165+
---
166+
167+
### 3. Prefer `\z` Over `$` for End-of-String
168+
169+
`\z` always matches the end of the string.
170+
`$` will also match a trailing newline at the end of the string,
171+
so if you use this in combination with capturing groups, you may not be capturing what you expect.
172+
Also, `\z` is more efficient, so it is better to use it in places where `\n` cannot appear.
173+
174+
```regex
175+
# Preferred
176+
/foo\z
177+
178+
# Risky: $ may allow an extra newline
179+
/foo$
180+
```
129181

130182
---
131183

132-
## Byte Search Optimization (Optional)
184+
### 4. Escape Special Characters When Used As Literal
133185

134-
Patterns that start with an **uncommon character** can be accelerated using an initial byte scan before running the FSM.
135-
This quickly jumps to likely match positions instead of scanning every byte.
186+
Many characters have special meaning in regex (for example `.`, `+`, `*`, `?`, `[`, `(`).
187+
If you mean to match them literally, escape them:
136188

137-
### Common fast byte search APIs
189+
```regex
190+
# Correct: dot is treated literally
191+
example\.com
192+
193+
# Incorrect: dot matches any character
194+
example.com. # also matches "exampleXcom"
195+
```
196+
197+
---
198+
199+
### 5. Use Non-Capturing Groups
200+
201+
Capture groups are _currently_ not supported (coming soon!).
202+
If you need grouping for alternation or precedence, use non-capturing syntax `(?:...)`:
203+
204+
```regex
205+
# Correct
206+
(?:private|no-store)
138207
139-
| Language | Function |
140-
| -------- | -------------------------- |
141-
| Go | `strings.IndexByte` |
142-
| Rust | `memchr::memchr` |
143-
| C | `memchr` from `<string.h>` |
208+
# Unsupported
209+
(private|no-store)
210+
```
144211

212+
---
145213

146-
### Good candidates
214+
## Byte Search Optimization (Optional)
147215

148-
Patterns that always start with uncommon prefix characters, for example:
216+
Patterns that start with an **uncommon character** can be accelerated using an initial byte scan before running the FSM.
217+
This quickly jumps to likely match positions instead of scanning every byte.
218+
219+
Good candidates are patterns that start with uncommon prefix characters, for example:
149220

150221
```
151222
#tag-[a-z]+
@@ -155,7 +226,9 @@ Patterns that always start with uncommon prefix characters, for example:
155226
"name='[^']+'"
156227
```
157228

158-
These prefixes (`#`, `@`, `[`, `{`, `'`, `"`) rarely appear in normal text, making a byte search highly effective.
229+
These prefixes (`#`, `@`, `[`, `{`, `'`, `"`) are rare in normal text, so a byte search can skip ahead before running the matcher.
230+
231+
We found using `strings.IndexByte` before calling the generated matcher in Go code significantly improved performance when matching strings with a large (>5k) leading prefix.
159232

160233
---
161234

@@ -171,16 +244,20 @@ Pattern:
171244

172245
Will compile to code that always returns true.
173246

247+
This is only an issue if that is not what you intend.
248+
174249
**Fix options:**
175250

176251
* Require at least one match: `\s+`
177-
* Anchor context: `^\s+$`
178-
* Or alternatively, use `-Fb` flag
252+
* Anchor context: `^\s+$` or alternatively, use `-Fb` flag
179253

180254
### Compilation Takes Too Long
181255

182-
Likely caused by unrestricted wildcards (`.*`, `.+`). Fix with:
256+
This is often caused by unrestricted wildcards (`.*`, `.+`).
257+
Although they look compact, libfsm must enumerate every possible byte and every possible continuation, causing the state machine to grow quickly.
258+
259+
For example, to match `var anything =`, a pattern such as `var\s.+=` looks simple, but `.+` forces libfsm to encode every possible byte
260+
and every possible continuation -- including both the presence and absence of `=`. This drastically increases the number of states.
183261

184-
* Negated classes (`[^)]*`)
185-
* Bounded repeats (`{0,50}`)
186-
* Pattern splitting
262+
When compilation is slow, look for broad wildcards and replace them with more specific character classes (as shown [above](#writing-effective-libfsm-patterns)),
263+
such as: `var\s[^=]+=`.

0 commit comments

Comments
 (0)