Skip to content

Commit 1b12385

Browse files
committed
Update RegexSyntax.md
1 parent 31c5cf5 commit 1b12385

File tree

1 file changed

+54
-39
lines changed

1 file changed

+54
-39
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 54 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -36,35 +36,47 @@ Note that there are minor syntactic incompatibilities and ambiguities involved i
3636

3737
## Detailed Design
3838

39-
We're proposing the following regular expression syntactic superset for Swift.
39+
We propose the following syntax for use inside Swift regex literals.
40+
41+
*TODO:* Disclosure triangle explaining the grammar conventions?
4042

4143
### Top-level regular expression
4244

4345
```
4446
Regex -> GlobalMatchingOptionSequence? RegexNode
4547
RegexNode -> '' | Alternation
48+
Alternation -> Concatenation ('|' Concatenation)*
49+
Concatenation -> (!'|' !')' ConcatComponent)*
4650
```
4751

48-
A top-level regular expression may consist of a sequence of global matching options followed by a `RegexNode`, which is the recursive part of the grammar that may be nested within e.g a group. A regex node may be empty, which is the null pattern that always matches, but does not advance the input.
49-
50-
### Alternation
52+
A regex literal may be prefixed with a sequence of global matching options(*TODO*: intra-doc link). A literal's contents can be empty or a sequence of alternatives separated by `|`.
5153

52-
```
53-
Alternation -> Concatenation ('|' Concatenation)*
54-
```
54+
Alternatives are a series of expressions concatenated together. The concatentation ends with either a `|` denoting the end of the alternative or a `)` denoting the end of a recursively parsed group.
5555

56-
The `|` operator denotes what is formally called an alternation, or a choice between alternatives. Any number of alternatives may appear, including empty alternatives. This operator has the lowest precedence of all operators in a regex literal.
56+
Alternation has a lower precedence than concatenation or other operations, so e.g `abc|def` matches against `abc` or `def`..
5757

58-
### Concatenation
58+
### Concatenated subexpressions
5959

6060
```
61-
Concatenation -> (!'|' !')' ConcatComponent)*
6261
ConcatComponent -> Trivia | Quote | Quantification
62+
63+
Trivia -> Comment | NonSemanticWhitespace
64+
Comment -> '(?#' (!')')* ')' | EndOfLineComment
65+
66+
(extended syntax only) EndOfLineComment -> '#' .*$
67+
(extended syntax only) NonSemanticWhitespace -> \s+
68+
69+
Quote -> '\Q' (!'\E' .)* '\E'
70+
6371
```
6472

65-
Implicitly denoted by adjacent expressions, a concatenation matches against a sequence of regular expression nodes. This has a higher precedence than an alternation, so e.g `abc|def` matches against `abc` or `def`. A concatenation may consist of potentially quantified expressions, trivia such as inline comments, and quoted sequences `\Q...\E`.
73+
Each component of a concatenation may be "trivia" (comments and non-semantic whitespace, if applicable), a quoted run of literal content, or a potentially-quantified subexpression.
74+
75+
In-line comments, similarly to C, are lexical and are not recursively nested like normal groups are. A closing `)` cannot be escaped. Quotes are similarly lexical, non-nested, and the `\` before a `\E` cannot be escaped.
6676

67-
### Quantification
77+
For example, `\Q^[xy]+$\E`, is treated as the literal characters `^[xy]+$` rather than an anchored quantified character class. `\Q\\E` is a literal `\`.
78+
79+
### Quantified subexpressions
6880

6981
```
7082
Quantification -> QuantOperand Quantifier?
@@ -453,36 +465,39 @@ The backslash character is also treated as literal within a quoted sequence, and
453465
### References
454466

455467
```
456-
NamedRef -> Identifier
457-
NumberRef -> ('+' | '-')? <Decimal Number> RecursionLevel?
458-
RecursionLevel -> '+' <Int> | '-' <Int>
468+
NamedOrNumberRef -> NamedRef | NumberRef
469+
NamedRef -> Identifier RecursionLevel?
470+
NumberRef -> ('+' | '-')? <Decimal Number> RecursionLevel?
471+
RecursionLevel -> '+' <Int> | '-' <Int>
459472
```
460473

461474
A reference is an abstract identifier for a particular capturing group in a regular expression. It can either be named or numbered, and in the latter case may be specified relative to the current group. For example `-2` refers to the capture group `N - 2` where `N` is the number of the next capture group. References may refer to groups ahead of the current position e.g `+3`, or the name of a future group. These may be useful in recursive cases where the group being referenced has been matched in a prior iteration.
462475

476+
A backreference may optionally include a recursion level in certain cases, which is a syntactic element inherited from Oniguruma that allows the reference to specify a capture relative to a given recursion level.
477+
463478
#### Backreferences
464479

465480
```
466-
Backreference -> '\g{' NameOrNumberRef '}'
481+
Backreference -> '\g{' NamedOrNumberRef '}'
467482
| '\g' NumberRef
468-
| '\k<' NameOrNumberRef '>'
469-
| "\k'" NameOrNumberRef "'"
470-
| '\k{' Identifier '}'
483+
| '\k<' NamedOrNumberRef '>'
484+
| "\k'" NamedOrNumberRef "'"
485+
| '\k{' NamedRef '}'
471486
| '\' [1-9] [0-9]+
472-
| '(?P=' Identifier ')'
487+
| '(?P=' NamedRef ')'
473488
```
474489

475490
A backreference evaluates to the value last captured by the referenced capturing group. If the referenced capture has not been evaluated yet, the match fails.
476491

477492
#### Subpatterns
478493

479494
```
480-
Subpattern -> '\g<' NameOrNumberRef '>'
481-
| "\g'" NameOrNumberRef "'"
495+
Subpattern -> '\g<' NamedOrNumberRef '>'
496+
| "\g'" NamedOrNumberRef "'"
482497
| '(?' GroupLikeSubpatternBody ')'
483498
484-
GroupLikeSubpatternBody -> 'P>' <String>
485-
| '&' <String>
499+
GroupLikeSubpatternBody -> 'P>' NamedRef
500+
| '&' NamedRef
486501
| 'R'
487502
| NumberRef
488503
```
@@ -500,9 +515,9 @@ GroupConditionalStart -> '(?' GroupStart
500515
501516
KnownCondition -> 'R'
502517
| 'R' NumberRef
503-
| 'R&' <String> !')'
504-
| '<' NameRef '>'
505-
| "'" NameRef "'"
518+
| 'R&' NamedRef
519+
| '<' NamedOrNumberRef '>'
520+
| "'" NamedOrNumberRef "'"
506521
| 'DEFINE'
507522
| 'VERSION' VersionCheck
508523
| NumberRef
@@ -806,26 +821,26 @@ We intend on canonicalizing to the short-form versions of these group kinds, e.g
806821
### Backreferences
807822

808823
```
809-
Backreference -> '\g{' NameOrNumberRef '}'
824+
Backreference -> '\g{' NamedOrNumberRef '}'
810825
| '\g' NumberRef
811-
| '\k<' NameOrNumberRef '>'
812-
| "\k'" NameOrNumberRef "'"
813-
| '\k{' Identifier '}'
826+
| '\k<' NamedOrNumberRef '>'
827+
| "\k'" NamedOrNumberRef "'"
828+
| '\k{' NamedRef '}'
814829
| '\' [1-9] [0-9]+
815-
| '(?P=' Identifier ')'
830+
| '(?P=' NamedRef ')'
816831
```
817832

818833
For absolute numeric references, we plan on choosing the canonical spelling `\DDD`, as it is unambiguous with octal sequences. For relative numbered references, as well as named references, we intend on canonicalizing to `\k<...>` to match the group name canonicalization `(?<...>)`. **TODO: How valuable is it to have canonical `\DDD`? Would it be better to just use `\k<...>` for everything?**
819834

820835
### Subpatterns
821836

822837
```
823-
Subpattern -> '\g<' NameOrNumberRef '>'
824-
| "\g'" NameOrNumberRef "'"
838+
Subpattern -> '\g<' NamedOrNumberRef '>'
839+
| "\g'" NamedOrNumberRef "'"
825840
| '(?' GroupLikeSubpatternBody ')'
826841
827-
GroupLikeSubpatternBody -> 'P>' <String>
828-
| '&' <String>
842+
GroupLikeSubpatternBody -> 'P>' NamedRef
843+
| '&' NamedRef
829844
| 'R'
830845
| NumberRef
831846
```
@@ -837,9 +852,9 @@ We intend on canonicalizing to the `\g<...>` spelling. **TODO: For `(?R)` too?**
837852
```
838853
KnownCondition -> 'R'
839854
| 'R' NumberRef
840-
| 'R&' <String> !')'
841-
| '<' NameRef '>'
842-
| "'" NameRef "'"
855+
| 'R&' NamedRef
856+
| '<' NamedOrNumberRef '>'
857+
| "'" NamedOrNumberRef "'"
843858
| 'DEFINE'
844859
| 'VERSION' VersionCheck
845860
| NumberRef

0 commit comments

Comments
 (0)