Skip to content

Commit c4d4ffe

Browse files
committed
Markup.
1 parent 227f4ac commit c4d4ffe

File tree

1 file changed

+95
-92
lines changed

1 file changed

+95
-92
lines changed

doc/advice.md

Lines changed: 95 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ As a result, libfsm avoids input-dependent slowdowns and is not susceptible to r
1919
- [Supported Code Generation Targets](#supported-code-generation-targets)
2020
- [Workflow Overview](#workflow-overview)
2121
- [Writing Effective libfsm Patterns](#writing-effective-libfsm-patterns)
22-
- [Byte Search Optimization (Optional)](#byte-search-optimization-optional)
22+
- [Byte Search Optimization](#byte-search-optimization-optional)
2323
- [Troubleshooting](#troubleshooting)
2424
- [Pattern Matches Empty String Unintentionally](#pattern-matches-empty-string-unintentionally)
2525

@@ -38,7 +38,7 @@ These PCRE features will not compile:
3838

3939
Generate a matcher from a regex:
4040

41-
```bash
41+
```sh
4242
# Generate a Go matcher
4343
re -p -r pcre -l go -k str 'user\d+' > user_detector.go
4444
```
@@ -56,151 +56,154 @@ Adding code generation for new languages is straightforward and is defined in [s
5656

5757
## Workflow Overview
5858

59-
libfsm provides two main tools:
60-
- **`re`** takes patterns from command line
61-
- **`rx`** takes patterns from file
59+
libfsm provides two main tools for pattern matching:
60+
- **`re`** takes patterns from the command line
61+
- **`rx`** takes patterns from a file
6262

6363
A recommended workflow when using libfsm is:
6464

6565
1. Validate the Regex
6666

67-
Test behavior using any PCRE-compatible tool (e.g., [pcregrep(1)](https://man7.org/linux/man-pages/man1/pcregrep.1.html) on the CLI or [https://regex101.com/](https://regex101.com/) in the browser).
67+
Test behavior using any PCRE-compatible tool (e.g., [pcregrep(1)](https://man7.org/linux/man-pages/man1/pcregrep.1.html) on the CLI or [https://regex101.com/](https://regex101.com/) in the browser).
6868

6969
2. Verify libfsm Compatibility
7070

71-
```bash
72-
re -r pcre -l ast 'x*?'
73-
# Output: /x*?/:3: Unsupported operator
74-
# :3 indicates that the character at offset 3 in the pattern is rejected.
71+
If unsupported constructs exist, libfsm reports the failing location:
72+
```sh
73+
re -r pcre -l ast 'x*?'
74+
# Output: /x*?/:3: Unsupported operator
75+
```
76+
In this example, `:3` indicates that the character at byte offset three in the pattern is an unsupported feature.
7577

76-
rx -r pcre -l ast -d declined.txt 'x*?'
77-
# Unsupported character in declined.txt
78-
```
78+
```sh
79+
# patterns with unsupported operators are output to declined.txt
80+
rx -r pcre -l ast -d declined.txt 'x*?'
81+
```
7982

80-
If unsupported constructs exist, libfsm reports the failing location.
8183

8284
3. Generate Code
8385

84-
```bash
85-
re -p -r pcre -l rust -k str '^item-[A-Z]{3}\z' > item_detector.rs
86-
```
86+
```sh
87+
re -p -r pcre -l rust -k str '^item-[A-Z]{3}\z' > item_detector.rs
88+
```
8789

8890
4. Multiple Patterns
8991

90-
```bash
91-
# re - patterns from command line:
92-
re -p -r pcre -l go -k str '^x?a b+c$' '^x*def?$' '^x$'
93-
94-
# rx - patterns from file:
95-
rx -p -r pcre -l vmc -k str -d skipped.txt patterns.txt > detectors.c
96-
```
92+
```sh
93+
# re - patterns from command line:
94+
re -p -r pcre -l go -k str '^x?a b+c$' '^x*def?$' '^x$'
95+
96+
# rx - patterns from file:
97+
rx -p -r pcre -l vmc -k str -d skipped.txt patterns.txt > detectors.c
98+
```
9799

98100
Both tools:
99101
* Combine all patterns into one function (like using `|` to join them)
100-
* Return `(bool, int)` - match status and pattern ID
102+
* Generate code that can return `(bool, int)` for the match status and pattern ID
101103
* Pattern ID is argument position for `re`, line number for `rx`
102-
* When encountering unsupported patterns: `rx` skips them to `-d` file and generates code with working patterns; `re` fails completely
104+
* When encountering unsupported patterns: `rx` can decline them to `-d` file and generates code with working patterns only; `re` fails completely
105+
106+
### Common Flags
103107

104-
### Flag Reference
105108
| Flag | Purpose | Common Options | Notes |
106-
| ---- | ---------------------------- | ------------------------------------------ | ---------------------------------------------------------------- |
109+
|:----:|:---------------------------- |:------------------------------------------ |:---------------------------------------------------------------- |
107110
| `-r` | Regex dialect | `pcre`, `literal`, `glob`, `native`, `sql` | `pcre` supports the widest set of features |
108111
| `-l` | Output language for printing | `go`, `rust`, `vmc`, `llvm`, `wasm`, `dot` | Use `vmc` for `C` code. Pipe `dot` into `idot` for visualization |
109112
| `-k` | Generated function I/O API | `str`, `getc`, `pair` | `str` takes string, `pair` takes byte array, `getc` uses callback for streaming |
110113
| `-p` | Print mode | *(no value)* | Abbrv. of `-l fsm`. Print the constructed fsm, rather than executing it. |
111114
| `-d` | Declined patterns | filename | Only applies to `rx` (batch mode) |
112115

113-
This is not exhausted list. For full flag details, see [include/fsm/options.h](../include/fsm/options.h) and the [man pages](../man).
116+
This is not an exhaustive list. For full flag details, see [include/fsm/options.h](../include/fsm/options.h) and the [man pages](../man).
114117
The man pages can be built by running `bmake -r doc`, then view with `build/man/re.1/re.1`.
115118

116119
## Writing Effective libfsm Patterns
117120

118121
1. Replace Broad Wildcards
119122

120-
Avoid `.*` and `.+` when possible. Wildcards match “anything,” which is often imprecise. And although they look compact, libfsm must enumerate every possible byte and continuation. This quickly leads to large DFAs.
123+
Avoid `.*` and `.+` when possible. Wildcards match “anything,” which is often imprecise. And although they look compact, libfsm must enumerate every possible byte and continuation. This quickly leads to large DFAs.
121124

122-
For example, a double-quoted string should not use `".*"` because the content cannot contain an unescaped quote. Using `.*` forces libfsm to consider all characters -- including both the presence and absence of the closing `"` at every step. This greatly increases the number of states.
125+
For example, a double-quoted string should not use `".*"` because the content cannot contain an unescaped quote. Using `.*` forces libfsm to consider all characters -- including both the presence and absence of the closing `"` at every step. This greatly increases the number of states.
123126
124-
Instead, restrict it to the actual valid characters `"[^"\r\n]*"`, which matches only what is allowed and will keep the DFA more compact.
127+
Instead, restrict it to the actual valid characters `"[^"\r\n]*"`, which matches only what is allowed and will keep the DFA more compact.
125128
126-
Use negated character classes to match only the allowed content:
129+
Use negated character classes to match only the allowed content:
127130
128-
| Avoid | Better |
129-
| ---------- | -------------- |
130-
| `<.*>` | `<[^>]*>` |
131-
| `\((.*)\)` | `\([^)]*\)` |
132-
| `price=.+` | `price=[0-9]+` |
133-
| `var\s.+=` | `var\s[^=]+=` |
131+
| Avoid | Better |
132+
| ---------- | -------------- |
133+
| `<.*>` | `<[^>]*>` |
134+
| `\((.*)\)` | `\([^)]*\)`|
135+
| `price=.+` | `price=[0-9]+` |
136+
| `var\s.+=` | `var\s[^=]+=` |
134137
135-
The overlap between `.*` or `.+` and strings that follow is often the cause of an “explosion” in the size of the generated FSM. So when compilation is slow or generated output is large, look for `.*` and `.+` first and replace them with a narrower character class.
138+
The overlap between `.*` or `.+` and strings that follow is often the cause of an “explosion” in the size of the generated FSM. So when compilation is slow or generated output is large, look for `.*` and `.+` first and replace them with a narrower character class.
136139
137140
2. Anchor When Matching Full String
138141
139-
When the intention is to match an entire string, use anchors.
140-
Use `^` at the beginning and `\z` for the true end of the string.
141-
142-
```regex
143-
# Correct: matches only this exact hostname
144-
# Matches "web12.example.com"
145-
# Does not match "foo-web12.example.com-bar"
146-
^web\d+\.example\.com\z 
147-
148-
# Incorrect: would match inside a larger string
149-
# Matches "web12.example.com"
150-
# Also matches "foo-web12.example.com-bar"
151-
web\d+\.example\.com
152-
```
142+
When the intention is to match an entire string, use anchors.
143+
Use `^` at the beginning and `\z` for the true end of the string.
144+
145+
```regex
146+
# Correct: matches only this exact hostname
147+
# Matches "web12.example.com"
148+
# Does not match "foo-web12.example.com-bar"
149+
^web\d+\.example\.com\z
150+
151+
# Incorrect: would match inside a larger string
152+
# Matches "web12.example.com"
153+
# Also matches "foo-web12.example.com-bar"
154+
web\d+\.example\.com
155+
```
153156
154157
3. Prefer `\z` Over `$` for End-of-String
155158
156-
`\z` always matches the end of the string.
157-
`$` will also match a trailing newline at the end of the string,
158-
so if you use this in combination with capturing groups, you may not be capturing what you expect.
159-
Also, `\z` produces a smaller FSM, so it is better to use it in places where `\n` cannot appear.
160-
161-
```regex
162-
# Preferred: matches only if the string ends with "bar"
163-
# Matches "/foo/bar"
164-
# Does NOT match "/foo/bar\n"
165-
/bar\z
166-
167-
# Incorrect: allows a trailing newline,
168-
# which is usually unintended and adds unnecessary complexity
169-
# Matches "/foo/bar"
170-
# Also matches "/foo/bar\n"
171-
/bar$
172-
```
159+
`\z` always matches the end of the string.
160+
`$` will also match a trailing newline at the end of the string,
161+
so if you use this in combination with capturing groups, you may not be capturing what you expect.
162+
Also, `\z` produces a smaller FSM, so it is better to use it in places where `\n` cannot appear.
163+
164+
```regex
165+
# Preferred: matches only if the string ends with "bar"
166+
# Matches "/foo/bar"
167+
# Does NOT match "/foo/bar\n"
168+
/bar\z
169+
170+
# Incorrect: allows a trailing newline,
171+
# which is usually unintended and adds unnecessary complexity
172+
# Matches "/foo/bar"
173+
# Also matches "/foo/bar\n"
174+
/bar$
175+
```
173176
174177
4. Escape Special Characters When Used As Literal
175178
176-
Many characters have special meaning in regex (for example `.`, `+`, `*`, `?`, `[`, `(`).
177-
If you mean to match them literally, escape them:
179+
Many characters have special meaning in regex (for example `.`, `+`, `*`, `?`, `[`, `(`).
180+
If you mean to match them literally, escape them:
178181
179-
| Literal You Want | Correct Regex | Explanation |
180-
|----------------------------|-----------------------------|--------------------------------------------|
181-
| `example.com` | `example\.com` | `.` matches any character unless escaped |
182-
| `a+b` | `a\+b` | `+` means “one or more” |
183-
| `price?` | `price\?` | `?` means “optional” |
184-
| `[value]` | `\[value\]` | `[` and `]` start/end a character class |
185-
| `(test)` | `\(test\)` | `(` and `)` begin/end a group |
186-
| Markdown link `[t](u)` | `(\[[^]]*\]\([^)]*\))` | Matches `[text](url)` without crossing `]` or `)` |
182+
| Literal You Want | Correct Regex | Explanation |
183+
|----------------------------|-----------------------------|--------------------------------------------|
184+
| `example.com` | `example\.com` | `.` matches any character unless escaped |
185+
| `a+b` | `a\+b` | `+` means “one or more” |
186+
| `price?` | `price\?` | `?` means “optional” |
187+
| `[value]` | `\[value\]` | `[` and `]` start/end a character class |
188+
| `(test)` | `\(test\)` | `(` and `)` begin/end a group |
189+
| Markdown link `[t](u)` | `(\[[^]]*\]\([^)]*\))` | Matches `[text](url)` without crossing `]` or `)` |
187190
188191
5. Use Non-Capturing Groups
189192
190-
Capture groups are _currently_ not supported (coming soon!).
191-
If you need grouping for alternation or precedence, use non-capturing syntax `(?:...)`:
193+
Capture groups are _currently_ not supported (coming soon!).
194+
If you need grouping for alternation or precedence, use non-capturing syntax `(?:...)`:
192195
193-
```regex
194-
# Correct
195-
(?:private|no-store)
196-
197-
# Unsupported
198-
(private|no-store)
199-
```
196+
```regex
197+
# Correct
198+
(?:private|no-store)
199+
200+
# Unsupported
201+
(private|no-store)
202+
```
200203
201-
## Byte Search Optimization (Optional)
204+
## Byte Search Optimization
202205
203-
Patterns that start with an **uncommon character** can be accelerated using an initial byte scan before running the FSM.
206+
Patterns that start with an uncommon character can be accelerated using an initial byte scan before running the FSM.
204207
This quickly jumps to likely match positions instead of scanning every byte.
205208
206209
Good candidates are patterns that start with uncommon prefix characters, for example:

0 commit comments

Comments
 (0)