You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/advice.md
+41-9Lines changed: 41 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -87,7 +87,13 @@ A recommended workflow when using libfsm is:
87
87
re -p -r pcre -l rust -k str '^item-[A-Z]{3}\z'> item_detector.rs
88
88
```
89
89
90
-
4. Multiple patterns
90
+
4. Use multiple patterns
91
+
92
+
Execution complexity for the generated code is proportional to the length of the text being matched, not to the number of patterns.
93
+
Assuming your generated code isn't too large to compile, this means you can have as many patterns as you want,
94
+
for the same time it takes to execute a single pattern.
95
+
96
+
Take advantage of this.
91
97
92
98
```sh
93
99
# re - patterns from command line:
@@ -124,7 +130,19 @@ The man pages can be built by running `bmake -r doc`, then view with `man build/
124
130
125
131
## Writing Effective libfsm Patterns
126
132
127
-
1. Replace Broad Wildcards
133
+
Generally, to keep generated code compact, stick to the least expressive subset of features.
134
+
135
+
libfsm has no way to know in advance what text you'll be passing to its generated code.
136
+
For example, are you matching a string that you know will never contain a newline?
137
+
libfsm doesn't know that.
138
+
It has to generate code that's capable of handling any input.
139
+
You can help it out by making your patterns precise.
140
+
141
+
Think about what you intend your pattern to match, and what it's actually capable of matching given arbitrary text.
142
+
This helps restrict the scope of your pattern from arbitrary text to exactly what you mean.
143
+
The following bits of advice illustrate various specific ways to bring down this scope.
144
+
145
+
1. Replace broad wildcards
128
146
129
147
Avoid `.*` and `.+` when possible. Wildcards match “anything,” which is often imprecise. And although they look compact, libfsm must enumerate every possible byte and continuation. This quickly leads to large DFAs.
130
148
@@ -143,7 +161,14 @@ The man pages can be built by running `bmake -r doc`, then view with `man build/
143
161
144
162
The overlap between `.*` or `.+` and strings that follow is often the cause of an “explosion” in the size of the generated FSM. So when compilation is slow or generated output is large, look for `.*` and `.+` first and replace them with a narrower character class.
145
163
146
-
2. Anchor When Matching Full String
164
+
2. Take care with bounded repetition
165
+
166
+
If you have the pattern ^x{3,5}$, libfsm's resulting DFA will be structured like "match an x, then match an x, then match an x, then match an x or skip it, then match an x or skip it, then report an overall match if at the end of input". It has to repeat the pattern, noting each time whether it's required or optional (beyond the lower count in {min,max}), because DFA execution doesn't have a counter, just the current state within the overall DFA.
167
+
168
+
When the subexpression (represented by `x`) unintentionally matches too many things, they all have to be spelled out every time.
169
+
So pay especially close attention to tightening up subexpressions in bounded repetition clauses.
170
+
171
+
3. Anchor when matching full string
147
172
148
173
When the intention is to match an entire string, use anchors.
149
174
Use `^` at the beginning and `\z`for the true end of the string.
@@ -160,7 +185,7 @@ The man pages can be built by running `bmake -r doc`, then view with `man build/
160
185
web\d+\.example\.com
161
186
```
162
187
163
-
3. Prefer `\z`Over`$`for End-of-String
188
+
4. Prefer `\z`over`$`for End-of-String
164
189
165
190
`\z` always matches the end of the string.
166
191
`$` will also match a trailing newline at the end of the string,
@@ -180,7 +205,7 @@ The man pages can be built by running `bmake -r doc`, then view with `man build/
180
205
/bar$
181
206
```
182
207
183
-
4. Escape Special Characters When Used As Literal
208
+
5. Escape special characters when used as literals
184
209
185
210
Many characters have special meaning in regex (for example `.`, `+`, `*`, `?`, `[`, `(`).
186
211
If you mean to match them literally, escape them:
@@ -194,16 +219,23 @@ The man pages can be built by running `bmake -r doc`, then view with `man build/
194
219
|`(test)`|`\(test\)`|`(` and `)` begin/end a group |
195
220
| Markdown link `[t](u)`|`(\[[^]]*\]\([^)]*\))`| Matches `[text](url)` without crossing `]` or `)`|
196
221
197
-
5. Use Non-Capturing Groups
222
+
The `.` wildcard in particular is often mistakenly left unescaped in practice.
223
+
On testing, it will match a literal `.` as intended. But it will also match any other character.
224
+
This means that not only is your pattern incorrect (write negative test cases!),
225
+
but also this part of your FSM is 256 times larger than it should be.
226
+
227
+
6. Use non-capturing groups
198
228
199
229
Capture groups are _currently_ not supported (coming soon!).
200
-
If you need grouping for alternation or precedence, use non-capturing syntax `(?:...)`:
230
+
231
+
If you don't need to capture things, don't use capture.
232
+
If you need grouping for alternation or precedence, use PCRE's non-capturing syntax `(?:...)`:
201
233
202
234
```regex
203
235
# Correct
204
236
(?:private|no-store)
205
237
206
-
#Unsupported
238
+
# Not what's intended
207
239
(private|no-store)
208
240
```
209
241
@@ -214,7 +246,7 @@ This quickly jumps to likely match positions instead of scanning every byte.
214
246
215
247
Good candidates are patterns that start with uncommon prefix characters, for example:
0 commit comments