Skip to content

Commit 20f262c

Browse files
Revision #2
1 parent 7306570 commit 20f262c

File tree

1 file changed

+46
-75
lines changed

1 file changed

+46
-75
lines changed

doc/GUIDE.md

Lines changed: 46 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,13 @@
22

33
libfsm compiles regular expressions to deterministic finite state machines (FSMs) and generates executable code. FSM-based matching runs in **linear time O(n)** with **no backtracking**.
44

5-
> Regex engines like PCRE use backtracking to explore multiple possible match paths at **runtime**.
6-
> This means the same pattern can have different execution costs depending on the input.
7-
>
8-
> libfsm instead resolves all match decisions at **compile time** by constructing a Deterministic Finite Automaton (DFA).
9-
> At runtime, matching is a single linear pass over the input with no alternative paths to explore.
10-
>
11-
> As a result, libfsm avoids input-dependent slowdowns and is not susceptible to regular expression–based denial-of-service (ReDoS) attacks.
5+
Regex engines like PCRE use backtracking to explore multiple possible match paths at **runtime**.
6+
This means the same pattern can have different execution costs depending on the input.
7+
8+
libfsm instead resolves all match decisions at **compile time** by constructing a Deterministic Finite Automaton (DFA).
9+
At runtime, matching is a single linear pass over the input with no alternative paths to explore.
10+
11+
As a result, libfsm avoids input-dependent slowdowns and is not susceptible to regular expression–based denial-of-service (ReDoS) attacks.
1212

1313
**libfsm is not a drop-in replacement for traditional regex engines.** It only supports patterns that can be compiled to FSMs.
1414

@@ -21,6 +21,7 @@ libfsm compiles regular expressions to deterministic finite state machines (FSMs
2121
- [Writing Effective libfsm Patterns](#writing-effective-libfsm-patterns)
2222
- [Byte Search Optimization (Optional)](#byte-search-optimization-optional)
2323
- [Troubleshooting](#troubleshooting)
24+
- [Pattern Matches Empty String Unintentionally](#pattern-matches-empty-string-unintentionally)
2425

2526
## What libfsm Cannot Do
2627

@@ -33,8 +34,6 @@ These PCRE features will not compile:
3334
* Conditional expressions (`(?(condition)then|else)`)
3435
* Recursion and subroutines (`(?R)`, `(?1)`)
3536

36-
---
37-
3837
## Quick Start
3938

4039
Generate a matcher from a regex:
@@ -46,21 +45,14 @@ re -p -r pcre -l go -k str 'user\d+' > user_detector.go
4645

4746
This produces a standalone matcher function.
4847

49-
---
50-
5148
## Supported Code Generation Targets
5249

5350
libfsm provides stable, “first-class” code generation for:
51+
- High-level languages: C (via `-l vmc`), Go, Rust
52+
- LLVM IR
53+
- Native WebAssembly
5454

55-
| Category | Output |
56-
| -------------------- | ------------------------------ |
57-
| High-level languages | **C (via `-l vmc`), Go, Rust** |
58-
| Toolchains | **LLVM IR** |
59-
| Virtualization | **Native WebAssembly** |
60-
61-
> Adding code generation for new languages is straightforward and is defined in [src/libfsm/print/](../src/libfsm/print/).
62-
63-
---
55+
Adding code generation for new languages is straightforward and is defined in [src/libfsm/print/](../src/libfsm/print/).
6456

6557
## Workflow Overview
6658

@@ -70,11 +62,11 @@ libfsm provides two main tools:
7062

7163
A recommended workflow when using libfsm is:
7264

73-
### 1. Validate the Regex
65+
1. Validate the Regex
7466

7567
Test behavior using any PCRE-compatible tool (e.g., [pcregrep(1)](https://man7.org/linux/man-pages/man1/pcregrep.1.html) on the CLI or [https://regex101.com/](https://regex101.com/) in the browser).
7668

77-
### 2. Verify libfsm Compatibility
69+
2. Verify libfsm Compatibility
7870

7971
```bash
8072
re -r pcre -l ast 'x*?'
@@ -87,13 +79,13 @@ rx -r pcre -l ast -d declined.txt 'x*?'
8779

8880
If unsupported constructs exist, libfsm reports the failing location.
8981

90-
### 3. Generate Code
82+
3. Generate Code
9183

9284
```bash
9385
re -p -r pcre -l rust -k str '^item-[A-Z]{3}\z' > item_detector.rs
9486
```
9587

96-
### 4. Multiple Patterns
88+
4. Multiple Patterns
9789

9890
```bash
9991
# re - patterns from command line:
@@ -109,32 +101,29 @@ Both tools:
109101
* Pattern ID is argument position for `re`, line number for `rx`
110102
* When encountering unsupported patterns: `rx` skips them to `-d` file and generates code with working patterns; `re` fails completely
111103

112-
---
113-
114104
### Flag Reference
115105
| Flag | Purpose | Common Options | Notes |
116106
| ---- | ---------------------------- | ------------------------------------------ | ---------------------------------------------------------------- |
117107
| `-r` | Regex dialect | `pcre`, `literal`, `glob`, `native`, `sql` | `pcre` supports the widest set of features |
118108
| `-l` | Output language for printing | `go`, `rust`, `vmc`, `llvm`, `wasm`, `dot` | Use `vmc` for `C` code. Pipe `dot` into `idot` for visualization |
119109
| `-k` | Generated function I/O API | `str`, `getc`, `pair` | `str` takes string, `pair` takes byte array, `getc` uses callback for streaming |
120110
| `-p` | Print mode | *(no value)* | Abbrv. of `-l fsm`. Print the constructed fsm, rather than executing it. |
121-
| `-d` | Declined | filename | Only applies to `rx` (batch mode) |
111+
| `-d` | Declined patterns | filename | Only applies to `rx` (batch mode) |
122112

123113
This is not exhausted list. For full flag details, see [include/fsm/options.h](../include/fsm/options.h) and the [man pages](../man).
124-
The man pages can be built by running `bmake doc`, then view with `build/man/re.1/re.1`.
125-
126-
---
114+
The man pages can be built by running `bmake -r doc`, then view with `build/man/re.1/re.1`.
127115

128116
## Writing Effective libfsm Patterns
129117

130-
### 1. Replace Broad Wildcards
118+
1. Replace Broad Wildcards
119+
120+
Avoid `.*` and `.+` when possible. Wildcards match “anything,” which is often imprecise. And although they look compact, libfsm must enumerate every possible byte and continuation. This quickly leads to large DFAs.
131121

132-
Avoid `.*` and `.+` when possible. Wildcards match “anything,” which is often imprecise and forces libfsm to build a large DFA.
122+
For example, a double-quoted string should not use `".*"` because the content cannot contain an unescaped quote. Using `.*` forces libfsm to consider all characters -- including both the presence and absence of the closing `"` at every step. This greatly increases the number of states.
133123

134-
For example, a double-quoted string should not use `".*?"` because the content cannot contain an unescaped quote.
135-
Instead, restrict it to the actual valid characters `"[^"\r\n]*"`, which matches only what is allowd and will keep the DFA more compact.
124+
Instead, restrict it to the actual valid characters `"[^"\r\n]*"`, which matches only what is allowed and will keep the DFA more compact.
136125

137-
Use negated character classes:
126+
Use negated character classes to match only the allowed content:
138127

139128
| Avoid | Better |
140129
| ---------- | -------------- |
@@ -143,45 +132,46 @@ Use negated character classes:
143132
| `price=.+` | `price=[0-9]+` |
144133
| `var\s.+=` | `var\s[^=]+=` |
145134

146-
> This is often the cause of an “explosion” in the size of the generated FSM.
147-
>
148-
> See [Compilation Takes Too Long](#compilation-takes-too-long) for more details.
135+
The overlap between `.*` or `.+` and strings that follow is often the cause of an “explosion” in the size of the generated FSM. So when compilation is slow or generated output is large, look for `.*` and `.+` first and replace them with a narrower character class.
149136

150-
---
151-
152-
### 2. Anchor When Matching Full String
137+
2. Anchor When Matching Full String
153138

154139
When the intention is to match an entire string, use anchors.
155140
Use `^` at the beginning and `\z` for the true end of the string.
156141

157142
```regex
158143
# Correct: matches only this exact hostname
144+
# Matches "web12.example.com"
145+
# Does not match "foo-web12.example.com-bar"
159146
^web\d+\.example\.com\z 
160147
161148
# Incorrect: would match inside a larger string
162-
web\d+\.example\.com # also matches "foo-web12.example.com-bar"
149+
# Matches "web12.example.com"
150+
# Also matches "foo-web12.example.com-bar"
151+
web\d+\.example\.com
163152
```
164153

165-
---
166-
167-
### 3. Prefer `\z` Over `$` for End-of-String
154+
3. Prefer `\z` Over `$` for End-of-String
168155

169156
`\z` always matches the end of the string.
170157
`$` will also match a trailing newline at the end of the string,
171158
so if you use this in combination with capturing groups, you may not be capturing what you expect.
172-
Also, `\z` is more efficient, so it is better to use it in places where `\n` cannot appear.
159+
Also, `\z` produces a smaller FSM, so it is better to use it in places where `\n` cannot appear.
173160

174161
```regex
175-
# Preferred
176-
/foo\z
177-
178-
# Risky: $ may allow an extra newline
179-
/foo$
162+
# Preferred: matches only if the string ends with "bar"
163+
# Matches "/foo/bar"
164+
# Does NOT match "/foo/bar\n"
165+
/bar\z
166+
167+
# Incorrect: allows a trailing newline,
168+
# which is usually unintended and adds unnecessary complexity
169+
# Matches "/foo/bar"
170+
# Also matches "/foo/bar\n"
171+
/bar$
180172
```
181173

182-
---
183-
184-
### 4. Escape Special Characters When Used As Literal
174+
4. Escape Special Characters When Used As Literal
185175

186176
Many characters have special meaning in regex (for example `.`, `+`, `*`, `?`, `[`, `(`).
187177
If you mean to match them literally, escape them:
@@ -195,9 +185,7 @@ If you mean to match them literally, escape them:
195185
| `(test)` | `\(test\)` | `(` and `)` begin/end a group |
196186
| Markdown link `[t](u)` | `(\[[^]]*\]\([^)]*\))` | Matches `[text](url)` without crossing `]` or `)` |
197187

198-
---
199-
200-
### 5. Use Non-Capturing Groups
188+
5. Use Non-Capturing Groups
201189

202190
Capture groups are _currently_ not supported (coming soon!).
203191
If you need grouping for alternation or precedence, use non-capturing syntax `(?:...)`:
@@ -210,8 +198,6 @@ If you need grouping for alternation or precedence, use non-capturing syntax `(?
210198
(private|no-store)
211199
```
212200

213-
---
214-
215201
## Byte Search Optimization (Optional)
216202

217203
Patterns that start with an **uncommon character** can be accelerated using an initial byte scan before running the FSM.
@@ -231,11 +217,7 @@ These prefixes (`#`, `@`, `[`, `{`, `'`, `"`) are rare in normal text, so a byte
231217

232218
We found using `strings.IndexByte` before calling the generated matcher in Go code significantly improved performance when matching strings with a large (>5k) leading prefix.
233219

234-
---
235-
236-
## Troubleshooting
237-
238-
### Pattern Matches Empty String Unintentionally
220+
## Pattern Matches Empty String Unintentionally
239221

240222
Pattern:
241223

@@ -251,14 +233,3 @@ This is only an issue if that is not what you intend.
251233

252234
* Require at least one match: `\s+`
253235
* Anchor context: `^\s+$` or alternatively, use `-Fb` flag
254-
255-
### Compilation Takes Too Long
256-
257-
This is often caused by unrestricted wildcards (`.*`, `.+`).
258-
Although they look compact, libfsm must enumerate every possible byte and every possible continuation, causing the state machine to grow quickly.
259-
260-
For example, to match `var anything =`, a pattern such as `var\s.+=` looks simple, but `.+` forces libfsm to encode every possible byte
261-
and every possible continuation -- including both the presence and absence of `=`. This drastically increases the number of states.
262-
263-
When compilation is slow, look for broad wildcards and replace them with more specific character classes (as shown [above](#writing-effective-libfsm-patterns)),
264-
such as: `var\s[^=]+=`.

0 commit comments

Comments
 (0)