You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -38,7 +38,7 @@ These PCRE features will not compile:
38
38
39
39
Generate a matcher from a regex:
40
40
41
-
```bash
41
+
```sh
42
42
# Generate a Go matcher
43
43
re -p -r pcre -l go -k str 'user\d+'> user_detector.go
44
44
```
@@ -56,151 +56,154 @@ Adding code generation for new languages is straightforward and is defined in [s
56
56
57
57
## Workflow Overview
58
58
59
-
libfsm provides two main tools:
60
-
-**`re`** takes patterns from command line
61
-
-**`rx`** takes patterns from file
59
+
libfsm provides two main tools for pattern matching:
60
+
-**`re`** takes patterns from the command line
61
+
-**`rx`** takes patterns from a file
62
62
63
63
A recommended workflow when using libfsm is:
64
64
65
65
1. Validate the Regex
66
66
67
-
Test behavior using any PCRE-compatible tool (e.g., [pcregrep(1)](https://man7.org/linux/man-pages/man1/pcregrep.1.html) on the CLI or [https://regex101.com/](https://regex101.com/) in the browser).
67
+
Test behavior using any PCRE-compatible tool (e.g., [pcregrep(1)](https://man7.org/linux/man-pages/man1/pcregrep.1.html) on the CLI or [https://regex101.com/](https://regex101.com/) in the browser).
68
68
69
69
2. Verify libfsm Compatibility
70
70
71
-
```bash
72
-
re -r pcre -l ast 'x*?'
73
-
# Output: /x*?/:3: Unsupported operator
74
-
# :3 indicates that the character at offset 3 in the pattern is rejected.
71
+
If unsupported constructs exist, libfsm reports the failing location:
72
+
```sh
73
+
re -r pcre -l ast 'x*?'
74
+
# Output: /x*?/:3: Unsupported operator
75
+
```
76
+
In this example, `:3` indicates that the character at byte offset three in the pattern is an unsupported feature.
75
77
76
-
rx -r pcre -l ast -d declined.txt 'x*?'
77
-
# Unsupported character in declined.txt
78
-
```
78
+
```sh
79
+
# patterns with unsupported operators are output to declined.txt
80
+
rx -r pcre -l ast -d declined.txt 'x*?'
81
+
```
79
82
80
-
If unsupported constructs exist, libfsm reports the failing location.
81
83
82
84
3. Generate Code
83
85
84
-
```bash
85
-
re -p -r pcre -l rust -k str '^item-[A-Z]{3}\z'> item_detector.rs
86
-
```
86
+
```sh
87
+
re -p -r pcre -l rust -k str '^item-[A-Z]{3}\z'> item_detector.rs
88
+
```
87
89
88
90
4. Multiple Patterns
89
91
90
-
```bash
91
-
# re - patterns from command line:
92
-
re -p -r pcre -l go -k str '^x?a b+c$''^x*def?$''^x$'
|`-r`| Regex dialect |`pcre`, `literal`, `glob`, `native`, `sql`|`pcre` supports the widest set of features |
108
111
|`-l`| Output language for printing |`go`, `rust`, `vmc`, `llvm`, `wasm`, `dot`| Use `vmc`for`C` code. Pipe `dot` into `idot`for visualization |
109
112
|`-k`| Generated functionI/O API |`str`, `getc`, `pair`|`str` takes string, `pair` takes byte array, `getc` uses callback for streaming |
110
113
|`-p`| Print mode |*(no value)*| Abbrv. of `-l fsm`. Print the constructed fsm, rather than executing it. |
111
114
|`-d`| Declined patterns | filename | Only applies to `rx` (batch mode) |
112
115
113
-
This is not exhausted list. For full flag details, see [include/fsm/options.h](../include/fsm/options.h) and the [man pages](../man).
116
+
This is not an exhaustive list. For full flag details, see [include/fsm/options.h](../include/fsm/options.h) and the [man pages](../man).
114
117
The man pages can be built by running `bmake -r doc`, then view with `build/man/re.1/re.1`.
115
118
116
119
## Writing Effective libfsm Patterns
117
120
118
121
1. Replace Broad Wildcards
119
122
120
-
Avoid `.*` and `.+` when possible. Wildcards match “anything,” which is often imprecise. And although they look compact, libfsm must enumerate every possible byte and continuation. This quickly leads to large DFAs.
123
+
Avoid `.*` and `.+` when possible. Wildcards match “anything,” which is often imprecise. And although they look compact, libfsm must enumerate every possible byte and continuation. This quickly leads to large DFAs.
121
124
122
-
For example, a double-quoted string should not use `".*"` because the content cannot contain an unescaped quote. Using `.*` forces libfsm to consider all characters -- including both the presence and absence of the closing `"` at every step. This greatly increases the number of states.
125
+
For example, a double-quoted string should not use `".*"` because the content cannot contain an unescaped quote. Using `.*` forces libfsm to consider all characters -- including both the presence and absence of the closing `"` at every step. This greatly increases the number of states.
123
126
124
-
Instead, restrict it to the actual valid characters `"[^"\r\n]*"`, which matches only what is allowed and will keep the DFA more compact.
127
+
Instead, restrict it to the actual valid characters `"[^"\r\n]*"`, which matches only what is allowed and will keep the DFA more compact.
125
128
126
-
Use negated character classes to match only the allowed content:
129
+
Use negated character classes to match only the allowed content:
127
130
128
-
| Avoid | Better |
129
-
| ---------- | -------------- |
130
-
|`<.*>`|`<[^>]*>`|
131
-
|`\((.*)\)`|`\([^)]*\)`|
132
-
|`price=.+`|`price=[0-9]+`|
133
-
|`var\s.+=`|`var\s[^=]+=`|
131
+
| Avoid | Better |
132
+
| ---------- | -------------- |
133
+
|`<.*>`|`<[^>]*>`|
134
+
|`\((.*)\)`|`\([^)]*\)`|
135
+
|`price=.+`|`price=[0-9]+`|
136
+
|`var\s.+=`|`var\s[^=]+=`|
134
137
135
-
The overlap between `.*` or `.+` and strings that follow is often the cause of an “explosion” in the size of the generated FSM. So when compilation is slow or generated output is large, look for `.*` and `.+` first and replace them with a narrower character class.
138
+
The overlap between `.*` or `.+` and strings that follow is often the cause of an “explosion” in the size of the generated FSM. So when compilation is slow or generated output is large, look for`.*` and `.+` first and replace them with a narrower character class.
136
139
137
140
2. Anchor When Matching Full String
138
141
139
-
When the intention is to match an entire string, use anchors.
140
-
Use `^` at the beginning and `\z` for the true end of the string.
141
-
142
-
```regex
143
-
# Correct: matches only this exact hostname
144
-
# Matches "web12.example.com"
145
-
# Does not match "foo-web12.example.com-bar"
146
-
^web\d+\.example\.com\z
147
-
148
-
# Incorrect: would match inside a larger string
149
-
# Matches "web12.example.com"
150
-
# Also matches "foo-web12.example.com-bar"
151
-
web\d+\.example\.com
152
-
```
142
+
When the intention is to match an entire string, use anchors.
143
+
Use `^` at the beginning and `\z`for the true end of the string.
144
+
145
+
```regex
146
+
# Correct: matches only this exact hostname
147
+
# Matches "web12.example.com"
148
+
# Does not match "foo-web12.example.com-bar"
149
+
^web\d+\.example\.com\z
150
+
151
+
# Incorrect: would match inside a larger string
152
+
# Matches "web12.example.com"
153
+
# Also matches "foo-web12.example.com-bar"
154
+
web\d+\.example\.com
155
+
```
153
156
154
157
3. Prefer `\z` Over `$`for End-of-String
155
158
156
-
`\z` always matches the end of the string.
157
-
`$` will also match a trailing newline at the end of the string,
158
-
so if you use this in combination with capturing groups, you may not be capturing what you expect.
159
-
Also, `\z` produces a smaller FSM, so it is better to use it in places where `\n` cannot appear.
160
-
161
-
```regex
162
-
# Preferred: matches only if the string ends with "bar"
163
-
# Matches "/foo/bar"
164
-
# Does NOT match "/foo/bar\n"
165
-
/bar\z
166
-
167
-
# Incorrect: allows a trailing newline,
168
-
# which is usually unintended and adds unnecessary complexity
169
-
# Matches "/foo/bar"
170
-
# Also matches "/foo/bar\n"
171
-
/bar$
172
-
```
159
+
`\z` always matches the end of the string.
160
+
`$` will also match a trailing newline at the end of the string,
161
+
so if you use this in combination with capturing groups, you may not be capturing what you expect.
162
+
Also, `\z` produces a smaller FSM, so it is better to use it in places where `\n` cannot appear.
163
+
164
+
```regex
165
+
# Preferred: matches only if the string ends with "bar"
166
+
# Matches "/foo/bar"
167
+
# Does NOT match "/foo/bar\n"
168
+
/bar\z
169
+
170
+
# Incorrect: allows a trailing newline,
171
+
# which is usually unintended and adds unnecessary complexity
172
+
# Matches "/foo/bar"
173
+
# Also matches "/foo/bar\n"
174
+
/bar$
175
+
```
173
176
174
177
4. Escape Special Characters When Used As Literal
175
178
176
-
Many characters have special meaning in regex (for example `.`, `+`, `*`, `?`, `[`, `(`).
177
-
If you mean to match them literally, escape them:
179
+
Many characters have special meaning in regex (for example `.`, `+`, `*`, `?`, `[`, `(`).
180
+
If you mean to match them literally, escape them:
178
181
179
-
| Literal You Want | Correct Regex | Explanation |
0 commit comments