Skip to content

Commit 4351273

Browse files
committed
Blurb on bounded repetition.
(Contributed by Scott)
1 parent bf867f2 commit 4351273

File tree

1 file changed

+41
-9
lines changed

1 file changed

+41
-9
lines changed

doc/advice.md

Lines changed: 41 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,13 @@ A recommended workflow when using libfsm is:
8787
re -p -r pcre -l rust -k str '^item-[A-Z]{3}\z' > item_detector.rs
8888
```
8989

90-
4. Multiple patterns
90+
4. Use multiple patterns
91+
92+
Execution complexity for the generated code is proportional to the length of the text being matched, not to the number of patterns.
93+
Assuming your generated code isn't too large to compile, this means you can have as many patterns as you want,
94+
for the same time it takes to execute a single pattern.
95+
96+
Take advantage of this.
9197
9298
```sh
9399
# re - patterns from command line:
@@ -124,7 +130,19 @@ The man pages can be built by running `bmake -r doc`, then view with `man build/
124130
125131
## Writing Effective libfsm Patterns
126132
127-
1. Replace Broad Wildcards
133+
Generally, to keep generated code compact, stick to the least expressive subset of features.
134+
135+
libfsm has no way to know in advance what text you'll be passing to its generated code.
136+
For example, are you matching a string that you know will never contain a newline?
137+
libfsm doesn't know that.
138+
It has to generate code that's capable of handling any input.
139+
You can help it out by making your patterns precise.
140+
141+
Think about what you intend your pattern to match, and what it's actually capable of matching given arbitrary text.
142+
This helps restrict the scope of your pattern from arbitrary text to exactly what you mean.
143+
The following bits of advice illustrate various specific ways to bring down this scope.
144+
145+
1. Replace broad wildcards
128146
129147
Avoid `.*` and `.+` when possible. Wildcards match “anything,” which is often imprecise. And although they look compact, libfsm must enumerate every possible byte and continuation. This quickly leads to large DFAs.
130148
@@ -143,7 +161,14 @@ The man pages can be built by running `bmake -r doc`, then view with `man build/
143161
144162
The overlap between `.*` or `.+` and strings that follow is often the cause of an “explosion” in the size of the generated FSM. So when compilation is slow or generated output is large, look for `.*` and `.+` first and replace them with a narrower character class.
145163
146-
2. Anchor When Matching Full String
164+
2. Take care with bounded repetition
165+
166+
If you have the pattern ^x{3,5}$, libfsm's resulting DFA will be structured like "match an x, then match an x, then match an x, then match an x or skip it, then match an x or skip it, then report an overall match if at the end of input". It has to repeat the pattern, noting each time whether it's required or optional (beyond the lower count in {min,max}), because DFA execution doesn't have a counter, just the current state within the overall DFA.
167+
168+
When the subexpression (represented by `x`) unintentionally matches too many things, they all have to be spelled out every time.
169+
So pay especially close attention to tightening up subexpressions in bounded repetition clauses.
170+
171+
3. Anchor when matching full string
147172

148173
When the intention is to match an entire string, use anchors.
149174
Use `^` at the beginning and `\z` for the true end of the string.
@@ -160,7 +185,7 @@ The man pages can be built by running `bmake -r doc`, then view with `man build/
160185
web\d+\.example\.com
161186
```
162187

163-
3. Prefer `\z` Over `$` for End-of-String
188+
4. Prefer `\z` over `$` for End-of-String
164189

165190
`\z` always matches the end of the string.
166191
`$` will also match a trailing newline at the end of the string,
@@ -180,7 +205,7 @@ The man pages can be built by running `bmake -r doc`, then view with `man build/
180205
/bar$
181206
```
182207

183-
4. Escape Special Characters When Used As Literal
208+
5. Escape special characters when used as literals
184209

185210
Many characters have special meaning in regex (for example `.`, `+`, `*`, `?`, `[`, `(`).
186211
If you mean to match them literally, escape them:
@@ -194,16 +219,23 @@ The man pages can be built by running `bmake -r doc`, then view with `man build/
194219
| `(test)` | `\(test\)` | `(` and `)` begin/end a group |
195220
| Markdown link `[t](u)` | `(\[[^]]*\]\([^)]*\))` | Matches `[text](url)` without crossing `]` or `)` |
196221
197-
5. Use Non-Capturing Groups
222+
The `.` wildcard in particular is often mistakenly left unescaped in practice.
223+
On testing, it will match a literal `.` as intended. But it will also match any other character.
224+
This means that not only is your pattern incorrect (write negative test cases!),
225+
but also this part of your FSM is 256 times larger than it should be.
226+
227+
6. Use non-capturing groups
198228
199229
Capture groups are _currently_ not supported (coming soon!).
200-
If you need grouping for alternation or precedence, use non-capturing syntax `(?:...)`:
230+
231+
If you don't need to capture things, don't use capture.
232+
If you need grouping for alternation or precedence, use PCRE's non-capturing syntax `(?:...)`:
201233
202234
```regex
203235
# Correct
204236
(?:private|no-store)
205237
206-
# Unsupported
238+
# Not what's intended
207239
(private|no-store)
208240
```
209241
@@ -214,7 +246,7 @@ This quickly jumps to likely match positions instead of scanning every byte.
214246
215247
Good candidates are patterns that start with uncommon prefix characters, for example:
216248
217-
```
249+
```regex
218250
#tag-[a-z]+
219251
@user-[0-9]+
220252
\[section\]

0 commit comments

Comments
 (0)