Skip to content

Commit 27802dc

Browse files
authored
Merge pull request #515 from katef/gsusanto-docs
Advice on how to use libfsm for generating performant pattern matchers
2 parents 504b1d0 + d8ab92e commit 27802dc

File tree

2 files changed

+279
-0
lines changed

2 files changed

+279
-0
lines changed

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,15 @@
44
; re -cb -pl dot '[Ll]ibf+(sm)*' '[Ll]ibre' | dot
55
![libfsm.svg](doc/tutorial/libfsm.svg)
66

7+
libfsm is not a drop-in replacement for other regex engines, and it only supports patterns that can be compiled to deterministic FSMs. In return, supported patterns run in linear time.
8+
79
Getting started:
810

911
* See the [tutorial introduction](doc/tutorial/re.md) for a quick overview
1012
of the re(1) command line interface.
1113
* [Compilation phases](doc/tutorial/phases.md) for typical applications
1214
which compile regular expressions to code.
15+
* [Advice on using libfsm](doc/advice.md) for suggestions around compilation time, unsupported features, common usage patterns, and examples.
1316

1417
You get:
1518

doc/advice.md

Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
# Advice on using libfsm for high-performance pattern matching
2+
3+
libfsm compiles regular expressions to deterministic finite state machines (FSMs) and generates executable code. FSM-based matching runs in **linear time O(n)** with **no backtracking**.
4+
5+
Regex engines like PCRE use backtracking to explore multiple possible match paths at **runtime**.
6+
This means the same pattern can have different execution costs depending on the input.
7+
8+
libfsm instead resolves all match decisions at **compile time** by constructing a Deterministic Finite Automaton (DFA).
9+
At runtime, matching is a single linear pass over the input with no alternative paths to explore.
10+
11+
As a result, libfsm avoids input-dependent slowdowns and is not susceptible to regular expression–based denial-of-service (ReDoS) attacks.
12+
13+
**libfsm is not a drop-in replacement for traditional regex engines.** It only supports patterns that can be compiled to FSMs.
14+
15+
### **Topics**
16+
17+
- [What libfsm Cannot Do](#what-libfsm-cannot-do)
18+
- [Quick Start](#quick-start)
19+
- [Supported Code Generation Targets](#supported-code-generation-targets)
20+
- [Workflow Overview](#workflow-overview)
21+
- [Writing Effective libfsm Patterns](#writing-effective-libfsm-patterns)
22+
- [Byte Search Optimization](#byte-search-optimization-optional)
23+
- [Troubleshooting](#troubleshooting)
24+
- [Pattern Matches Empty String Unintentionally](#pattern-matches-empty-string-unintentionally)
25+
26+
## What libfsm Cannot Do
27+
28+
These PCRE features will not compile:
29+
30+
* Word boundaries (`\b`)
31+
* Non-greedy quantifiers (`*?`, `+?`, `??`)
32+
* Group capture (coming soon!) and backreferences
33+
* Lookahead/lookbehind assertions (`(?=`, `(?!`, `(?<=`, `(?<!`)
34+
* Conditional expressions (`(?(condition)then|else)`)
35+
* Recursion and subroutines (`(?R)`, `(?1)`)
36+
37+
## Quick Start
38+
39+
Generate a matcher from a regex:
40+
41+
```sh
42+
# Generate a Go matcher
43+
re -p -r pcre -l go -k str 'user\d+' > user_detector.go
44+
```
45+
46+
This produces a standalone matcher function.
47+
48+
## Supported Code Generation Targets
49+
50+
libfsm provides stable, “first-class” code generation for:
51+
- High-level languages: C (via `-l vmc`), Go, Rust
52+
- LLVM IR
53+
- Native WebAssembly
54+
55+
Adding code generation for new languages is straightforward and is defined in [src/libfsm/print/](../src/libfsm/print/).
56+
57+
## Workflow Overview
58+
59+
libfsm provides two main tools for pattern matching:
60+
- **`re`** takes patterns from the command line
61+
- **`rx`** takes patterns from a file
62+
63+
A recommended workflow when using libfsm is:
64+
65+
1. Validate the regex
66+
67+
Test behavior using any PCRE-compatible tool (e.g., [pcregrep(1)](https://man7.org/linux/man-pages/man1/pcregrep.1.html) on the CLI or [https://regex101.com/](https://regex101.com/) in the browser).
68+
69+
2. Verify libfsm compatibility
70+
71+
If unsupported constructs exist, libfsm reports the failing location:
72+
```sh
73+
re -r pcre -l ast 'x*?'
74+
# Output: /x*?/:3: Unsupported operator
75+
```
76+
In this example, `:3` indicates that the character at byte offset three in the pattern is an unsupported feature.
77+
78+
```sh
79+
# patterns with unsupported operators are output to declined.txt
80+
rx -r pcre -l ast -d declined.txt 'x*?'
81+
```
82+
83+
84+
3. Generate code
85+
86+
```sh
87+
re -p -r pcre -l rust -k str '^item-[A-Z]{3}\z' > item_detector.rs
88+
```
89+
90+
4. Use multiple patterns
91+
92+
Execution complexity for the generated code is proportional to the length of the text being matched, not to the number of patterns.
93+
Assuming your generated code isn't too large to compile, this means you can have as many patterns as you want,
94+
for the same time it takes to execute a single pattern.
95+
96+
Take advantage of this.
97+
98+
```sh
99+
# re - patterns from command line:
100+
re -p -r pcre -l go -k str '^x?a b+c$' '^x*def?$' '^x$'
101+
102+
# rx - patterns from file:
103+
rx -p -r pcre -l vmc -k str -d skipped.txt patterns.txt > detectors.c
104+
```
105+
106+
5. Call the generated code from your program somehow
107+
108+
You're on your own for this. `-k` controls the API for the generated code to read in data to match. Try different options for the language you're using and see which suits you.
109+
110+
The generated API can also vary depending on how you want libfsm to handle ambiguities between different patterns. See the `AMBIG_*` flags in [include/fsm/options.h](../include/fsm/options.h) for different approaches there.
111+
112+
Both tools:
113+
* Combine all patterns into one function (like using `|` to join them)
114+
* Generate code that can return `(bool, int)` for the match status and pattern ID
115+
* Pattern ID is argument position for `re`, line number for `rx`
116+
* When encountering unsupported patterns: `rx` can decline them to `-d` file and generates code with working patterns only; `re` fails completely
117+
118+
### Common Flags
119+
120+
| Flag | Purpose | Common Options | Notes |
121+
|:----:|:---------------------------- |:------------------------------------------ |:---------------------------------------------------------------- |
122+
| `-r` | Regex dialect | `pcre`, `literal`, `glob`, `native`, `sql` | `pcre` supports the widest set of features |
123+
| `-l` | Output language for printing | `go`, `rust`, `vmc`, `llvm`, `wasm`, `dot` | Use `vmc` for `C` code. Pipe `dot` into `idot` for visualization |
124+
| `-k` | Generated function I/O API | `str`, `getc`, `pair` | `str` takes string, `pair` takes byte array, `getc` uses callback for streaming |
125+
| `-p` | Print mode | *(no value)* | Abbrv. of `-l fsm`. Print the constructed fsm, rather than executing it. |
126+
| `-d` | Declined patterns | filename | Only applies to `rx` (batch mode) |
127+
128+
This is not an exhaustive list. For full flag details, see [include/fsm/options.h](../include/fsm/options.h) and the [man pages](../man).
129+
The man pages can be built by running `bmake -r doc`, then view with `man build/man/re.1/re.1`.
130+
131+
## Writing Effective libfsm Patterns
132+
133+
Generally, to keep generated code compact, stick to the least expressive subset of features.
134+
135+
libfsm has no way to know in advance what text you'll be passing to its generated code.
136+
For example, are you matching a string that you know will never contain a newline?
137+
libfsm doesn't know that.
138+
It has to generate code that's capable of handling any input.
139+
You can help it out by making your patterns precise.
140+
141+
Think about what you intend your pattern to match, and what it's actually capable of matching given arbitrary text.
142+
This helps restrict the scope of your pattern from arbitrary text to exactly what you mean.
143+
The following bits of advice illustrate various specific ways to bring down this scope.
144+
145+
1. Replace broad wildcards
146+
147+
Avoid `.*` and `.+` when possible. Wildcards match “anything,” which is often imprecise. And although they look compact, libfsm must enumerate every possible byte and continuation. This quickly leads to large DFAs.
148+
149+
For example, a double-quoted string should not use `".*"` because the content cannot contain an unescaped quote. Using `.*` forces libfsm to consider all characters -- including both the presence and absence of the closing `"` at every step. This greatly increases the number of states.
150+
151+
Instead, restrict it to the actual valid characters `"[^"\r\n]*"`, which matches only what is allowed and will keep the DFA more compact.
152+
153+
Use negated character classes to match only the allowed content:
154+
155+
| Avoid | Better |
156+
| ---------- | -------------- |
157+
| `<.*>` | `<[^>]*>` |
158+
| `\((.*)\)` | `\([^)]*\)`|
159+
| `price=.+` | `price=[0-9]+` |
160+
| `var\s.+=` | `var\s[^=]+=` |
161+
162+
The overlap between `.*` or `.+` and strings that follow is often the cause of an “explosion” in the size of the generated FSM. So when compilation is slow or generated output is large, look for `.*` and `.+` first and replace them with a narrower character class.
163+
164+
2. Take care with bounded repetition
165+
166+
If you have the pattern `^x{3,5}$`, libfsm's resulting DFA will be structured like "match an x, then match an x, then match an x, then match an x or skip it, then match an x or skip it, then report an overall match if at the end of input". It has to repeat the pattern, noting each time whether it's required or optional (beyond the lower count in `{min,max}`), because DFA execution doesn't have a counter, just the current state within the overall DFA.
167+
168+
When the subexpression (represented by `x`) unintentionally matches too many things, they all have to be spelled out every time.
169+
So pay especially close attention to tightening up subexpressions in bounded repetition clauses.
170+
171+
3. Anchor when matching full string
172+
173+
When the intention is to match an entire string, use anchors.
174+
Use `^` at the beginning and `\z` for the true end of the string.
175+
176+
```regex
177+
# Correct: matches only this exact hostname
178+
# Matches "web12.example.com"
179+
# Does not match "foo-web12.example.com-bar"
180+
^web\d+\.example\.com\z
181+
182+
# Incorrect: would match inside a larger string
183+
# Matches "web12.example.com"
184+
# Also matches "foo-web12.example.com-bar"
185+
web\d+\.example\.com
186+
```
187+
188+
4. Prefer `\z` over `$` for End-of-String
189+
190+
`\z` always matches the end of the string.
191+
`$` will also match a trailing newline at the end of the string,
192+
so if you use this in combination with capturing groups, you may not be capturing what you expect.
193+
Also, `\z` produces a smaller FSM, so it is better to use it in places where `\n` cannot appear.
194+
195+
```regex
196+
# Preferred: matches only if the string ends with "bar"
197+
# Matches "/foo/bar"
198+
# Does NOT match "/foo/bar\n"
199+
/bar\z
200+
201+
# Incorrect: allows a trailing newline,
202+
# which is usually unintended and adds unnecessary complexity
203+
# Matches "/foo/bar"
204+
# Also matches "/foo/bar\n"
205+
/bar$
206+
```
207+
208+
5. Escape special characters when used as literals
209+
210+
Many characters have special meaning in regex (for example `.`, `+`, `*`, `?`, `[`, `(`).
211+
If you mean to match them literally, escape them:
212+
213+
| Literal You Want | Correct Regex | Explanation |
214+
|----------------------------|-----------------------------|--------------------------------------------|
215+
| `example.com` | `example\.com` | `.` matches any character unless escaped |
216+
| `a+b` | `a\+b` | `+` means “one or more” |
217+
| `price?` | `price\?` | `?` means “optional” |
218+
| `[value]` | `\[value\]` | `[` and `]` start/end a character class |
219+
| `(test)` | `\(test\)` | `(` and `)` begin/end a group |
220+
| Markdown link `[t](u)` | `(\[[^]]*\]\([^)]*\))` | Matches `[text](url)` without crossing `]` or `)` |
221+
222+
The `.` wildcard in particular is often mistakenly left unescaped in practice.
223+
On testing, it will match a literal `.` as intended. But it will also match any other character.
224+
This means that not only is your pattern incorrect (write negative test cases!),
225+
but also this part of your FSM is 256 times larger than it should be.
226+
227+
6. Use non-capturing groups
228+
229+
Capture groups are _currently_ not supported (coming soon!).
230+
231+
If you don't need to capture things, don't use capture.
232+
If you need grouping for alternation or precedence, use PCRE's non-capturing syntax `(?:...)`:
233+
234+
```regex
235+
# Correct
236+
(?:private|no-store)
237+
238+
# Not what's intended
239+
(private|no-store)
240+
```
241+
242+
## Byte Search Optimization
243+
244+
Patterns that start with an uncommon character can be accelerated using an initial byte scan before running the FSM.
245+
This quickly jumps to likely match positions instead of scanning every byte.
246+
247+
Good candidates are patterns that start with uncommon prefix characters, for example:
248+
249+
```regex
250+
#tag-[a-z]+
251+
@user-[0-9]+
252+
\[section\]
253+
{"key":
254+
"name='[^']+'"
255+
```
256+
257+
These prefixes (`#`, `@`, `[`, `{`, `'`, `"`) are rare in normal text, so a byte search can skip ahead before running the matcher.
258+
259+
We found using `strings.IndexByte` before calling the generated matcher in Go code significantly improved performance when matching strings with a large (>5k) leading prefix.
260+
261+
## Pattern Matches Empty String Unintentionally
262+
263+
Pattern:
264+
265+
```regex
266+
\s*
267+
```
268+
269+
Will compile to code that always returns true.
270+
271+
This is only an issue if that is not what you intend.
272+
273+
**Fix options:**
274+
275+
* Require at least one match: `\s+`
276+
* Anchor context: `^\s+$` or alternatively, use `-Fb` flag

0 commit comments

Comments
 (0)