You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexSyntax.md
+95-80Lines changed: 95 additions & 80 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
<!--
2
-
Hello, we want to issue an update to [Regular Expression Literals](https://forums.swift.org/t/pitch-regular-expression-literals/52820) and prepare for a formal proposal. The great delimiter delibration continues to unfold, so in the meantime, we have a significant amount of surface area to present for review/feedback: the syntax _inside_ a regex literal.
2
+
Hello, we want to issue an update to [Regular Expression Literals](https://forums.swift.org/t/pitch-regular-expression-literals/52820) and prepare for a formal proposal. The great delimiter deliberation continues to unfold, so in the meantime, we have a significant amount of surface area to present for review/feedback: the syntax _inside_ a regex literal.
Implicitly denoted by adjacent expressions, a concatenation matches against a sequence of regular expression patterns. This has a higher precedence than an alternation, so e.g `abc|def` matches against `abc` or `def`. The `ConcatComponent` token varies across engine, but at least matches some form of trivia, e.g comments, quoted sequences e.g `\Q...\E`, and a potentially quantified expression.
65
+
Implicitly denoted by adjacent expressions, a concatenation matches against a sequence of regular expression nodes. This has a higher precedence than an alternation, so e.g `abc|def` matches against `abc` or `def`. A concatenation may consist of potentially quantified expressions, trivia such as inline comments, and quoted sequences `\Q...\E`.
QuantOperand -> AbsentFunction | Atom | Conditional | CustomCharClass | Group
77
77
```
78
78
79
-
A quantification consists of an operand optionally followed by a quantifier that specifier how many times it may be matched. An operand without a quantifier is matched once.
79
+
A quantification consists of an operand optionally followed by a quantifier that specifies how many times it may be matched. An operand without a quantifier is matched once.
80
80
81
81
The quantifiers supported are:
82
82
83
-
-`?`: 0 or 1 matches
84
-
-`*`: 0 or more matches
85
-
-`+`: 1 or more matches
86
-
-`{n,m}`: Between `n` and `m` (inclusive) matches
87
-
-`{n,}`: `n` or more matches
88
-
-`{,m}`: Up to `m` matches
89
-
-`{n}`: Exactly `n` matches
83
+
-`?`: 0 or 1 matches.
84
+
-`*`: 0 or more matches.
85
+
-`+`: 1 or more matches.
86
+
-`{n,m}`: Between `n` and `m` (inclusive) matches.
87
+
-`{n,}`: `n` or more matches.
88
+
-`{,m}`: Up to `m` matches.
89
+
-`{n}`: Exactly `n` matches.
90
90
91
-
A quantifier may optionally be followed by `?` or `+`, which adjust its semantics. If neither are specified, by default the quantification happens *eagerly*, meaning that it will try to maximize the number of matches made. However, if `?` is specified, quantification happens *reluctantly*, meaning that the number of matches will instead be minimized. If `+` is specified, *possessive* matching occurs, which is eager matching with the additional semantic that it may not be backtracked into to try a different number of matches.
91
+
A quantifier may optionally be followed by `?` or `+`, which adjusts its semantics. If neither are specified, by default the quantification happens *eagerly*, meaning that it will try to maximize the number of matches made. However, if `?` is specified, quantification happens *reluctantly*, meaning that the number of matches will instead be minimized. If `+` is specified, *possessive* matching occurs, which is eager matching with the additional semantic that it may not be backtracked into to try a different number of matches.
92
92
93
93
### Atom
94
94
@@ -107,7 +107,7 @@ Atom -> Anchor
107
107
| '\'? <Character>
108
108
```
109
109
110
-
Atoms are the smallest units of regular expression syntax. They include escape sequences e.g `\b`, `\d`, as well as meta-characters such as `.` and `$`. They also include some larger syntactic constructs such as backreferences and callouts. The most basic form of atom is a literal character. A meta-character may be treated as literal by preceding it with a backslash. Other characters may also be preceded with a backslash, but it has no effect if they are unknown escape sequences, e.g `\I` is literal `I`.
110
+
Atoms are the smallest units of regular expression syntax. They include escape sequences e.g `\b`, `\d`, as well as metacharacters such as `.` and `$`. They also include some larger syntactic constructs such as backreferences and callouts. The most basic form of atom is a literal character. A metacharacter may be treated as literal by preceding it with a backslash. Other literal characters may also be preceded with a backslash, but it has no effect if they are unknown escape sequences, e.g `\I` is literal `I`.
Groups define a new scope within which a recursive regular expression pattern may occur. Groups have different semantics depending on how they are introduced.
147
-
148
-
Groups may be named, the characters of which may be any letter or number characters or the character `_`. However the name must not start with a number. This restriction follows the behavior of other regex engines and avoids ambiguities when it comes to named and numeric group references.
149
-
150
-
Groups may be used to change the matching options present within their scope, see the *Matching options* section.
146
+
Groups define a new scope that contains a recursive regular expression pattern. Groups have different semantics depending on how they are introduced, the details of which are laid out in the following sections.
151
147
152
148
Note there are additional constructs that may syntactically appear similar to groups, but are distinct. See the *group-like atoms* section.
153
149
@@ -157,18 +153,24 @@ Note there are additional constructs that may syntactically appear similar to gr
157
153
-`(?:)`: A non-capturing group.
158
154
-`(?|)`: A group that, for a direct child alternation, resets the numbering of groups at each branch of that alternation. See *group numbering*.
159
155
156
+
Capturing groups produce captures, which remember the range of input matched for the scope of that group.
157
+
158
+
A capturing group may be named using any of the `NamedGroup` syntax. The characters of the group name may be any letter or number characters or the character `_`. However the name must not start with a number. This restriction follows the behavior of other regex engines and avoids ambiguities when it comes to named and numeric group references.
159
+
160
160
#### Atomic groups
161
161
162
-
An atomic group e.g `(?>...)` specifies that its contents should not be re-evaluated for backtracking. This the same semantics as a possessive quantifier, but applies more generally to any regex pattern.
162
+
An atomic group e.g `(?>...)` specifies that its contents should not be re-evaluated for backtracking. This has the same semantics as a possessive quantifier, but applies more generally to any regex pattern.
163
163
164
164
#### Lookahead and lookbehind
165
165
166
-
-`(?=`: A lookahead that attempts to match against the group body, but does not advance.
167
-
-`(?!`: A negative lookahead that ensures the group body does not match, and does not advance.
168
-
-`(?<=`: A lookbehind that attempts to match the group body against the input before the current position. Does not advance the input.
169
-
-`(?!<`: A negative lookbehind that ensures the group body does not match the input before the current position. Does not advance the input.
166
+
These groups evaluate the input ahead or behind the current matching position, without advancing the input.
170
167
171
-
These groups are all atomic, meaning that they will not be re-evaluated for backtracking. There are however also non-atomic variants:
168
+
-`(?=`: A lookahead, which matches against the input following the current matching position.
169
+
-`(?!`: A negative lookahead, which ensures a negative match against the input following the current matching position.
170
+
-`(?<=`: A lookbehind, which matches against the input prior to the current matching position.
171
+
-`(?!<`: A negative lookbehind, which ensures a negative match against the input prior to the current matching position.
172
+
173
+
The above groups are all atomic, meaning that they will not be re-evaluated for backtracking. There are however also non-atomic variants:
Introduced by .NET, balancing groups extend the `GroupNameBody` syntax to support the ability to refer to a prior group. Upon matching, the prior group is deleted, and any intermediate matched input becomes the capture of the current group.
189
191
192
+
#### Group numbering
193
+
194
+
Capturing groups are implicitly numbered according to the position of their opening `(` in the regex. For example:
195
+
196
+
```
197
+
(a((?:b)(?<c>c)d)(e)f)
198
+
^ ^ ^ ^
199
+
1 2 3 4
200
+
```
201
+
202
+
Non-capturing groups are skipped over when counting.
203
+
204
+
Branch reset groups can alter this numbering, as they reset the numbering in the branches of an alternation child. Outside the alternation, numbering resumes at the next available number not used in one of the branches. For example:
A matching option sequence may be used as a group specifier, and denotes a change in matching options for the scope of that group. For example `(?x:a b c)` enables extended syntax for `a b c`. A matching option sequence may be part of an "isolated group" which has an implicit scope that wraps the remaining elements of the current group. For example, `(?x)a b c` also enables extended syntax for `a b c`.
201
223
224
+
If used in the branch of an alternation, an isolated group affects all the following branches of that alternation. For example, `a(?i)b|c|d` is treated as `a(?i:b)|(?i:c)|(?i:d)`.
225
+
202
226
We support all the matching options accepted by PCRE, ICU, and Oniguruma. In addition, we accept some matching options unique to our matching engine.
203
227
204
228
#### PCRE options
205
229
206
-
-`i`: Case insensitive matching
207
-
-`J`: Allows multiple groups to share the same name, which is otherwise forbidden
208
-
-`m`: Enables `^` and `$` to match against the start and end of a line rather than only the start and end of the entire string
209
-
-`n`: Disables capturing of `(...)` groups. Named capture groups must be used instead.
230
+
-`i`: Case insensitive matching.
231
+
-`J`: Allows multiple groups to share the same name, which is otherwise forbidden.
232
+
-`m`: Enables `^` and `$` to match against the start and end of a line rather than only the start and end of the entire string.
233
+
-`n`: Disables the capturing behavior of `(...)` groups. Named capture groups must be used instead.
210
234
-`s`: Changes `.` to match any character, including newlines.
211
235
-`U`: Changes quantifiers to be reluctant by default, with the `?` specifier changing to mean greedy.
212
236
-`x`, `xx`: Enables extended syntax mode, which allows non-semantic whitespace and end-of-line comments. See the *trivia* section for more info.
@@ -217,10 +241,10 @@ We support all the matching options accepted by PCRE, ICU, and Oniguruma. In add
217
241
218
242
#### Oniguruma options
219
243
220
-
-`D`: Enables ASCII-only digit matching for `\d`, `\p{Digit}`, `[:digit:]`
221
-
-`S`: Enables ASCII-only space matching for `\s`, `\p{Space}`, `[:space:]`
222
-
-`W`: Enables ASCII-only word matching for `\w`, `\p{Word}`, `[:word:]`, and `\b`
223
-
-`P`: Enables ASCII-only for all POSIX properties (including `digit`, `space`, and `word`)
244
+
-`D`: Enables ASCII-only digit matching for `\d`, `\p{Digit}`, `[:digit:]`.
245
+
-`S`: Enables ASCII-only space matching for `\s`, `\p{Space}`, `[:space:]`.
246
+
-`W`: Enables ASCII-only word matching for `\w`, `\p{Word}`, `[:word:]`, and `\b`.
247
+
-`P`: Enables ASCII-only for all POSIX properties (including `digit`, `space`, and `word`).
224
248
-`y{g}`, `y{w}`: Changes the meaning of `\X`, `\y`, `\Y`. These are mutually exclusive options, with `y{g}` specifying extended grapheme cluster mode, and `y{w}` specifying word mode.
225
249
226
250
#### Swift options
@@ -266,19 +290,19 @@ HexDigit -> [0-9a-zA-Z]
266
290
OctalDigit -> [0-7]
267
291
```
268
292
269
-
These sequences define a unicode scalar value to be matched against. There is both syntax for specifying the scalar value in hex notation, as well as octal notation. Note that `\x`, when not followed by any hexadecimal digit characters, is treated as `\0`, matching PCRE's behavior.
293
+
These sequences define a unicode scalar value to be matched against. There is syntax for both specifying the scalar value in hex notation, as well as octal notation. Note that `\x`, when not followed by any hexadecimal digit characters, is treated as `\0`, matching PCRE's behavior.
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal. The basic element in a custom character class is an `Atom`, though only a few atoms are considered valid:
324
348
325
-
- Builtin character classes except `.`, `\R`, `\O`, `\X`, `\C`, and `\N`.
326
-
- Escape sequences, including `\b` which becomes the backspace character (rather than a word boundary)
327
-
- Unicode scalars
328
-
- Named scalars
329
-
- Character properties
330
-
- Plain literal characters
349
+
- Builtin character classes, except for`.`, `\R`, `\O`, `\X`, `\C`, and `\N`.
350
+
- Escape sequences, including `\b` which becomes the backspace character (rather than a word boundary).
351
+
- Unicode scalars.
352
+
- Named scalars.
353
+
- Character properties.
354
+
- Plain literal characters.
331
355
332
356
Atoms may be used to compose other character class members, including ranges, quoted sequences, and even nested custom character classes `[[ab]c\d]`. Adjacent members form an implicit union of character classes, e.g `[[ab]c\d]` is the union of the characters `a`, `b`, `c`, and digit characters.
333
357
334
358
Custom character classes may not be empty, e.g `[]` is forbidden. A custom character class may begin with the `]` character, in which case it is treated as literal, e.g `[]a]` is the custom character class of `]` and `a`.
335
359
336
-
Quoted sequences may be used to escape the contained characters, e.g `[\Q]\E]` is the character class of the literal character `[`.
360
+
Quoted sequences may be used to escape the contained characters, e.g `[a\Q]\E]` is also the character class of `[` and `a`.
337
361
338
362
Ranges of characters may be specified with `-`, e.g `[a-z]` matches against the letters from `a` to `z`. Only unicode scalars and literal characters are valid range operands. If `-` cannot be used to form a range, it is interpreted as literal, e.g `[-a-]` is the character class of `-` and `a`. `[a-c-d]` is the character class of `a`...`c`, `-`, and `d`.
339
363
340
364
Operators may be used to apply set operations to character class members. The operators supported are:
341
365
342
-
-`&&`: Intersection of the LHS and RHS
343
-
-`--`: Subtraction of the RHS from the LHS
344
-
-`~~`: Symmetric difference of the RHS and LHS
366
+
-`&&`: Intersection of the LHS and RHS.
367
+
-`--`: Subtraction of the RHS from the LHS.
368
+
-`~~`: Symmetric difference of the RHS and LHS.
345
369
-`-`: .NET's spelling of subtracting the RHS from the LHS.
346
370
347
371
These operators have a lower precedence than the implicit union of members, e.g `[ac-d&&a[d]]` is an intersection of the character classes `[ac-d]` and `[ad]`.
A reference is an abstract identifier for a particular capturing group in a regular expression. It can either be named or numbered, and in the latter case may be specified relative to the current group. For example `-2` refers to the capture group `N - 2` where `N` is the number of the next capture group. References may refer to groups ahead of the current position e.g `+3`, or the name of a future group. These may be useful in recursive cases where the group being referenced has been matched in a prior iteration.
438
462
439
-
#### Group numbering
440
-
441
-
Capturing groups are implicitly numbered according to the position of their opening `(` in the regex. For example:
442
-
443
-
```
444
-
(a((?:b)(?<c>c)d)(e)f)
445
-
^ ^ ^ ^
446
-
1 2 3 4
447
-
```
448
-
449
-
Non-capturing groups are skipped over when counting.
450
-
451
-
Branch reset groups can alter this numbering, as they reset the numbering in the branches of an alternation child. Outside the alternation, numbering resumes at the next available number not used in one of the branches. For example:
452
-
453
-
```
454
-
(a()(?|(b)(c)|(?:d)|(e)))(f)
455
-
^ ^ ^ ^ ^ ^
456
-
1 2 3 4 3 5
457
-
```
458
-
459
463
#### Backreferences
460
464
461
465
```
@@ -824,7 +828,18 @@ We intend on canonicalizing to the `\g<...>` spelling. **TODO: For `(?R)` too?**
824
828
825
829
### Conditional references
826
830
827
-
**TODO: Decide**
831
+
```
832
+
KnownCondition -> 'R'
833
+
| 'R' NumberRef
834
+
| 'R&' <String> !')'
835
+
| '<' NameRef '>'
836
+
| "'" NameRef "'"
837
+
| 'DEFINE'
838
+
| 'VERSION' VersionCheck
839
+
| NumberRef
840
+
```
841
+
842
+
For named references in a group condition, there is a choice between `(?('name'))` and `(?(<name>))`. We intend on canonicalizing to `(?(<name>))` to match the group name canonicalization.
0 commit comments