Skip to content

Commit ff717a3

Browse files
committed
Update RegexSyntax.md
1 parent 5a2517f commit ff717a3

File tree

1 file changed

+95
-80
lines changed

1 file changed

+95
-80
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 95 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
<!--
2-
Hello, we want to issue an update to [Regular Expression Literals](https://forums.swift.org/t/pitch-regular-expression-literals/52820) and prepare for a formal proposal. The great delimiter delibration continues to unfold, so in the meantime, we have a significant amount of surface area to present for review/feedback: the syntax _inside_ a regex literal.
2+
Hello, we want to issue an update to [Regular Expression Literals](https://forums.swift.org/t/pitch-regular-expression-literals/52820) and prepare for a formal proposal. The great delimiter deliberation continues to unfold, so in the meantime, we have a significant amount of surface area to present for review/feedback: the syntax _inside_ a regex literal.
33
-->
44

55
# Regex Literal Interior Syntax
@@ -62,7 +62,7 @@ Concatenation -> (!'|' !')' ConcatComponent)*
6262
ConcatComponent -> Trivia | Quote | Quantification
6363
```
6464

65-
Implicitly denoted by adjacent expressions, a concatenation matches against a sequence of regular expression patterns. This has a higher precedence than an alternation, so e.g `abc|def` matches against `abc` or `def`. The `ConcatComponent` token varies across engine, but at least matches some form of trivia, e.g comments, quoted sequences e.g `\Q...\E`, and a potentially quantified expression.
65+
Implicitly denoted by adjacent expressions, a concatenation matches against a sequence of regular expression nodes. This has a higher precedence than an alternation, so e.g `abc|def` matches against `abc` or `def`. A concatenation may consist of potentially quantified expressions, trivia such as inline comments, and quoted sequences `\Q...\E`.
6666

6767
### Quantification
6868

@@ -76,19 +76,19 @@ Range -> ',' <Int> | <Int> ',' <Int>? | <Int>
7676
QuantOperand -> AbsentFunction | Atom | Conditional | CustomCharClass | Group
7777
```
7878

79-
A quantification consists of an operand optionally followed by a quantifier that specifier how many times it may be matched. An operand without a quantifier is matched once.
79+
A quantification consists of an operand optionally followed by a quantifier that specifies how many times it may be matched. An operand without a quantifier is matched once.
8080

8181
The quantifiers supported are:
8282

83-
- `?`: 0 or 1 matches
84-
- `*`: 0 or more matches
85-
- `+`: 1 or more matches
86-
- `{n,m}`: Between `n` and `m` (inclusive) matches
87-
- `{n,}`: `n` or more matches
88-
- `{,m}`: Up to `m` matches
89-
- `{n}`: Exactly `n` matches
83+
- `?`: 0 or 1 matches.
84+
- `*`: 0 or more matches.
85+
- `+`: 1 or more matches.
86+
- `{n,m}`: Between `n` and `m` (inclusive) matches.
87+
- `{n,}`: `n` or more matches.
88+
- `{,m}`: Up to `m` matches.
89+
- `{n}`: Exactly `n` matches.
9090

91-
A quantifier may optionally be followed by `?` or `+`, which adjust its semantics. If neither are specified, by default the quantification happens *eagerly*, meaning that it will try to maximize the number of matches made. However, if `?` is specified, quantification happens *reluctantly*, meaning that the number of matches will instead be minimized. If `+` is specified, *possessive* matching occurs, which is eager matching with the additional semantic that it may not be backtracked into to try a different number of matches.
91+
A quantifier may optionally be followed by `?` or `+`, which adjusts its semantics. If neither are specified, by default the quantification happens *eagerly*, meaning that it will try to maximize the number of matches made. However, if `?` is specified, quantification happens *reluctantly*, meaning that the number of matches will instead be minimized. If `+` is specified, *possessive* matching occurs, which is eager matching with the additional semantic that it may not be backtracked into to try a different number of matches.
9292

9393
### Atom
9494

@@ -107,7 +107,7 @@ Atom -> Anchor
107107
| '\'? <Character>
108108
```
109109

110-
Atoms are the smallest units of regular expression syntax. They include escape sequences e.g `\b`, `\d`, as well as meta-characters such as `.` and `$`. They also include some larger syntactic constructs such as backreferences and callouts. The most basic form of atom is a literal character. A meta-character may be treated as literal by preceding it with a backslash. Other characters may also be preceded with a backslash, but it has no effect if they are unknown escape sequences, e.g `\I` is literal `I`.
110+
Atoms are the smallest units of regular expression syntax. They include escape sequences e.g `\b`, `\d`, as well as metacharacters such as `.` and `$`. They also include some larger syntactic constructs such as backreferences and callouts. The most basic form of atom is a literal character. A metacharacter may be treated as literal by preceding it with a backslash. Other literal characters may also be preceded with a backslash, but it has no effect if they are unknown escape sequences, e.g `\I` is literal `I`.
111111

112112
#### `\K`
113113

@@ -143,11 +143,7 @@ GroupNameBody -> Identifier | BalancingGroupBody
143143
Identifier -> [\w--\d] \w*
144144
```
145145

146-
Groups define a new scope within which a recursive regular expression pattern may occur. Groups have different semantics depending on how they are introduced.
147-
148-
Groups may be named, the characters of which may be any letter or number characters or the character `_`. However the name must not start with a number. This restriction follows the behavior of other regex engines and avoids ambiguities when it comes to named and numeric group references.
149-
150-
Groups may be used to change the matching options present within their scope, see the *Matching options* section.
146+
Groups define a new scope that contains a recursive regular expression pattern. Groups have different semantics depending on how they are introduced, the details of which are laid out in the following sections.
151147

152148
Note there are additional constructs that may syntactically appear similar to groups, but are distinct. See the *group-like atoms* section.
153149

@@ -157,18 +153,24 @@ Note there are additional constructs that may syntactically appear similar to gr
157153
- `(?:)`: A non-capturing group.
158154
- `(?|)`: A group that, for a direct child alternation, resets the numbering of groups at each branch of that alternation. See *group numbering*.
159155

156+
Capturing groups produce captures, which remember the range of input matched for the scope of that group.
157+
158+
A capturing group may be named using any of the `NamedGroup` syntax. The characters of the group name may be any letter or number characters or the character `_`. However the name must not start with a number. This restriction follows the behavior of other regex engines and avoids ambiguities when it comes to named and numeric group references.
159+
160160
#### Atomic groups
161161

162-
An atomic group e.g `(?>...)` specifies that its contents should not be re-evaluated for backtracking. This the same semantics as a possessive quantifier, but applies more generally to any regex pattern.
162+
An atomic group e.g `(?>...)` specifies that its contents should not be re-evaluated for backtracking. This has the same semantics as a possessive quantifier, but applies more generally to any regex pattern.
163163

164164
#### Lookahead and lookbehind
165165

166-
- `(?=`: A lookahead that attempts to match against the group body, but does not advance.
167-
- `(?!`: A negative lookahead that ensures the group body does not match, and does not advance.
168-
- `(?<=`: A lookbehind that attempts to match the group body against the input before the current position. Does not advance the input.
169-
- `(?!<`: A negative lookbehind that ensures the group body does not match the input before the current position. Does not advance the input.
166+
These groups evaluate the input ahead or behind the current matching position, without advancing the input.
170167

171-
These groups are all atomic, meaning that they will not be re-evaluated for backtracking. There are however also non-atomic variants:
168+
- `(?=`: A lookahead, which matches against the input following the current matching position.
169+
- `(?!`: A negative lookahead, which ensures a negative match against the input following the current matching position.
170+
- `(?<=`: A lookbehind, which matches against the input prior to the current matching position.
171+
- `(?!<`: A negative lookbehind, which ensures a negative match against the input prior to the current matching position.
172+
173+
The above groups are all atomic, meaning that they will not be re-evaluated for backtracking. There are however also non-atomic variants:
172174

173175
- `(?*`: A non-atomic lookahead.
174176
- `(?<*`: A non-atomic lookbehind.
@@ -187,6 +189,26 @@ BalancingGroupBody -> Identifier? '-' Identifier
187189

188190
Introduced by .NET, balancing groups extend the `GroupNameBody` syntax to support the ability to refer to a prior group. Upon matching, the prior group is deleted, and any intermediate matched input becomes the capture of the current group.
189191

192+
#### Group numbering
193+
194+
Capturing groups are implicitly numbered according to the position of their opening `(` in the regex. For example:
195+
196+
```
197+
(a((?:b)(?<c>c)d)(e)f)
198+
^ ^ ^ ^
199+
1 2 3 4
200+
```
201+
202+
Non-capturing groups are skipped over when counting.
203+
204+
Branch reset groups can alter this numbering, as they reset the numbering in the branches of an alternation child. Outside the alternation, numbering resumes at the next available number not used in one of the branches. For example:
205+
206+
```
207+
(a()(?|(b)(c)|(?:d)|(e)))(f)
208+
^ ^ ^ ^ ^ ^
209+
1 2 3 4 3 5
210+
```
211+
190212
### Matching options
191213

192214
```
@@ -199,14 +221,16 @@ MatchingOption -> 'i' | 'J' | 'm' | 'n' | 's' | 'U' | 'x' | 'xx' | 'w' | 'D' | '
199221

200222
A matching option sequence may be used as a group specifier, and denotes a change in matching options for the scope of that group. For example `(?x:a b c)` enables extended syntax for `a b c`. A matching option sequence may be part of an "isolated group" which has an implicit scope that wraps the remaining elements of the current group. For example, `(?x)a b c` also enables extended syntax for `a b c`.
201223

224+
If used in the branch of an alternation, an isolated group affects all the following branches of that alternation. For example, `a(?i)b|c|d` is treated as `a(?i:b)|(?i:c)|(?i:d)`.
225+
202226
We support all the matching options accepted by PCRE, ICU, and Oniguruma. In addition, we accept some matching options unique to our matching engine.
203227

204228
#### PCRE options
205229

206-
- `i`: Case insensitive matching
207-
- `J`: Allows multiple groups to share the same name, which is otherwise forbidden
208-
- `m`: Enables `^` and `$` to match against the start and end of a line rather than only the start and end of the entire string
209-
- `n`: Disables capturing of `(...)` groups. Named capture groups must be used instead.
230+
- `i`: Case insensitive matching.
231+
- `J`: Allows multiple groups to share the same name, which is otherwise forbidden.
232+
- `m`: Enables `^` and `$` to match against the start and end of a line rather than only the start and end of the entire string.
233+
- `n`: Disables the capturing behavior of `(...)` groups. Named capture groups must be used instead.
210234
- `s`: Changes `.` to match any character, including newlines.
211235
- `U`: Changes quantifiers to be reluctant by default, with the `?` specifier changing to mean greedy.
212236
- `x`, `xx`: Enables extended syntax mode, which allows non-semantic whitespace and end-of-line comments. See the *trivia* section for more info.
@@ -217,10 +241,10 @@ We support all the matching options accepted by PCRE, ICU, and Oniguruma. In add
217241

218242
#### Oniguruma options
219243
220-
- `D`: Enables ASCII-only digit matching for `\d`, `\p{Digit}`, `[:digit:]`
221-
- `S`: Enables ASCII-only space matching for `\s`, `\p{Space}`, `[:space:]`
222-
- `W`: Enables ASCII-only word matching for `\w`, `\p{Word}`, `[:word:]`, and `\b`
223-
- `P`: Enables ASCII-only for all POSIX properties (including `digit`, `space`, and `word`)
244+
- `D`: Enables ASCII-only digit matching for `\d`, `\p{Digit}`, `[:digit:]`.
245+
- `S`: Enables ASCII-only space matching for `\s`, `\p{Space}`, `[:space:]`.
246+
- `W`: Enables ASCII-only word matching for `\w`, `\p{Word}`, `[:word:]`, and `\b`.
247+
- `P`: Enables ASCII-only for all POSIX properties (including `digit`, `space`, and `word`).
224248
- `y{g}`, `y{w}`: Changes the meaning of `\X`, `\y`, `\Y`. These are mutually exclusive options, with `y{g}` specifying extended grapheme cluster mode, and `y{w}` specifying word mode.
225249

226250
#### Swift options
@@ -266,19 +290,19 @@ HexDigit -> [0-9a-zA-Z]
266290
OctalDigit -> [0-7]
267291
```
268292

269-
These sequences define a unicode scalar value to be matched against. There is both syntax for specifying the scalar value in hex notation, as well as octal notation. Note that `\x`, when not followed by any hexadecimal digit characters, is treated as `\0`, matching PCRE's behavior.
293+
These sequences define a unicode scalar value to be matched against. There is syntax for both specifying the scalar value in hex notation, as well as octal notation. Note that `\x`, when not followed by any hexadecimal digit characters, is treated as `\0`, matching PCRE's behavior.
270294

271295
### Escape sequences
272296

273297
```
274298
EscapeSequence -> '\a' | '\b' | '\c' <Char> | '\e' | '\f' | '\n' | '\r' | '\t'
275299
```
276300

277-
These escape sequences denote a specific character.
301+
These escape sequences each denote a specific scalar value.
278302

279303
- `\a`: The alert (bell) character `U+7`.
280304
- `\b`: The backspace character `U+8`. Note this may only be used in a custom character class, otherwise it represents a word boundary.
281-
- `\c <Char>`: A control character sequence (`U+00` - `U+7F`).
305+
- `\c <Char>`: A control character sequence, which denotes a scalar from `U+00` - `U+7F` depending on the ASCII character provided.
282306
- `\e`: The escape character `U+1B`.
283307
- `\f`: The form-feed character `U+C`.
284308
- `\n`: The newline character `U+A`.
@@ -291,22 +315,22 @@ These escape sequences denote a specific character.
291315
BuiltinCharClass -> '.' | '\C' | '\d' | '\D' | '\h' | '\H' | '\N' | '\O' | '\R' | '\s' | '\S' | '\v' | '\V' | '\w' | '\W' | '\X'
292316
```
293317

294-
- `.`: Any character excluding newlines
295-
- `\C`: A single UTF code unit
296-
- `\d`: Digit character
297-
- `\D`: Non-digit character
298-
- `\h`: Horizontal space character
299-
- `\H`: Non-horizontal-space character
300-
- `\N`: Non-newline character
318+
- `.`: Any character excluding newlines.
319+
- `\C`: A single UTF code unit.
320+
- `\d`: Digit character.
321+
- `\D`: Non-digit character.
322+
- `\h`: Horizontal space character.
323+
- `\H`: Non-horizontal-space character.
324+
- `\N`: Non-newline character.
301325
- `\O`: Any character (including newlines). This is syntax from Oniguruma.
302-
- `\R`: Newline sequence
303-
- `\s`: Whitespace character
304-
- `\S`: Non-whitespace character
305-
- `\v`: Vertical space character
306-
- `\V`: Non-vertical-space character
307-
- `\w`: Word character
308-
- `\W`: Non-word character
309-
- `\X`: Any extended grapheme cluster
326+
- `\R`: Newline sequence.
327+
- `\s`: Whitespace character.
328+
- `\S`: Non-whitespace character.
329+
- `\v`: Vertical space character.
330+
- `\V`: Non-vertical-space character.
331+
- `\w`: Word character.
332+
- `\W`: Non-word character.
333+
- `\X`: Any extended grapheme cluster.
310334

311335
### Custom character classes
312336

@@ -322,26 +346,26 @@ SetOp -> '&&' | '--' | '~~' | '-'
322346

323347
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal. The basic element in a custom character class is an `Atom`, though only a few atoms are considered valid:
324348

325-
- Builtin character classes except `.`, `\R`, `\O`, `\X`, `\C`, and `\N`.
326-
- Escape sequences, including `\b` which becomes the backspace character (rather than a word boundary)
327-
- Unicode scalars
328-
- Named scalars
329-
- Character properties
330-
- Plain literal characters
349+
- Builtin character classes, except for `.`, `\R`, `\O`, `\X`, `\C`, and `\N`.
350+
- Escape sequences, including `\b` which becomes the backspace character (rather than a word boundary).
351+
- Unicode scalars.
352+
- Named scalars.
353+
- Character properties.
354+
- Plain literal characters.
331355

332356
Atoms may be used to compose other character class members, including ranges, quoted sequences, and even nested custom character classes `[[ab]c\d]`. Adjacent members form an implicit union of character classes, e.g `[[ab]c\d]` is the union of the characters `a`, `b`, `c`, and digit characters.
333357

334358
Custom character classes may not be empty, e.g `[]` is forbidden. A custom character class may begin with the `]` character, in which case it is treated as literal, e.g `[]a]` is the custom character class of `]` and `a`.
335359

336-
Quoted sequences may be used to escape the contained characters, e.g `[\Q]\E]` is the character class of the literal character `[`.
360+
Quoted sequences may be used to escape the contained characters, e.g `[a\Q]\E]` is also the character class of `[` and `a`.
337361

338362
Ranges of characters may be specified with `-`, e.g `[a-z]` matches against the letters from `a` to `z`. Only unicode scalars and literal characters are valid range operands. If `-` cannot be used to form a range, it is interpreted as literal, e.g `[-a-]` is the character class of `-` and `a`. `[a-c-d]` is the character class of `a`...`c`, `-`, and `d`.
339363

340364
Operators may be used to apply set operations to character class members. The operators supported are:
341365

342-
- `&&`: Intersection of the LHS and RHS
343-
- `--`: Subtraction of the RHS from the LHS
344-
- `~~`: Symmetric difference of the RHS and LHS
366+
- `&&`: Intersection of the LHS and RHS.
367+
- `--`: Subtraction of the RHS from the LHS.
368+
- `~~`: Symmetric difference of the RHS and LHS.
345369
- `-`: .NET's spelling of subtracting the RHS from the LHS.
346370

347371
These operators have a lower precedence than the implicit union of members, e.g `[ac-d&&a[d]]` is an intersection of the character classes `[ac-d]` and `[ad]`.
@@ -436,26 +460,6 @@ RecursionLevel -> '+' <Int> | '-' <Int>
436460

437461
A reference is an abstract identifier for a particular capturing group in a regular expression. It can either be named or numbered, and in the latter case may be specified relative to the current group. For example `-2` refers to the capture group `N - 2` where `N` is the number of the next capture group. References may refer to groups ahead of the current position e.g `+3`, or the name of a future group. These may be useful in recursive cases where the group being referenced has been matched in a prior iteration.
438462

439-
#### Group numbering
440-
441-
Capturing groups are implicitly numbered according to the position of their opening `(` in the regex. For example:
442-
443-
```
444-
(a((?:b)(?<c>c)d)(e)f)
445-
^ ^ ^ ^
446-
1 2 3 4
447-
```
448-
449-
Non-capturing groups are skipped over when counting.
450-
451-
Branch reset groups can alter this numbering, as they reset the numbering in the branches of an alternation child. Outside the alternation, numbering resumes at the next available number not used in one of the branches. For example:
452-
453-
```
454-
(a()(?|(b)(c)|(?:d)|(e)))(f)
455-
^ ^ ^ ^ ^ ^
456-
1 2 3 4 3 5
457-
```
458-
459463
#### Backreferences
460464

461465
```
@@ -824,7 +828,18 @@ We intend on canonicalizing to the `\g<...>` spelling. **TODO: For `(?R)` too?**
824828

825829
### Conditional references
826830

827-
**TODO: Decide**
831+
```
832+
KnownCondition -> 'R'
833+
| 'R' NumberRef
834+
| 'R&' <String> !')'
835+
| '<' NameRef '>'
836+
| "'" NameRef "'"
837+
| 'DEFINE'
838+
| 'VERSION' VersionCheck
839+
| NumberRef
840+
```
841+
842+
For named references in a group condition, there is a choice between `(?('name'))` and `(?(<name>))`. We intend on canonicalizing to `(?(<name>))` to match the group name canonicalization.
828843

829844
### PCRE Callouts
830845

0 commit comments

Comments
 (0)