You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexSyntax.md
+54-39Lines changed: 54 additions & 39 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,35 +36,47 @@ Note that there are minor syntactic incompatibilities and ambiguities involved i
36
36
37
37
## Detailed Design
38
38
39
-
We're proposing the following regular expression syntactic superset for Swift.
39
+
We propose the following syntax for use inside Swift regex literals.
40
+
41
+
*TODO:* Disclosure triangle explaining the grammar conventions?
40
42
41
43
### Top-level regular expression
42
44
43
45
```
44
46
Regex -> GlobalMatchingOptionSequence? RegexNode
45
47
RegexNode -> '' | Alternation
48
+
Alternation -> Concatenation ('|' Concatenation)*
49
+
Concatenation -> (!'|' !')' ConcatComponent)*
46
50
```
47
51
48
-
A top-level regular expression may consist of a sequence of global matching options followed by a `RegexNode`, which is the recursive part of the grammar that may be nested within e.g a group. A regex node may be empty, which is the null pattern that always matches, but does not advance the input.
49
-
50
-
### Alternation
52
+
A regex literal may be prefixed with a sequence of global matching options(*TODO*: intra-doc link). A literal's contents can be empty or a sequence of alternatives separated by `|`.
51
53
52
-
```
53
-
Alternation -> Concatenation ('|' Concatenation)*
54
-
```
54
+
Alternatives are a series of expressions concatenated together. The concatentation ends with either a `|` denoting the end of the alternative or a `)` denoting the end of a recursively parsed group.
55
55
56
-
The `|` operator denotes what is formally called an alternation, or a choice between alternatives. Any number of alternatives may appear, including empty alternatives. This operator has the lowest precedence of all operators in a regex literal.
56
+
Alternation has a lower precedence than concatenation or other operations, so e.g `abc|def` matches against `abc` or `def`..
Implicitly denoted by adjacent expressions, a concatenation matches against a sequence of regular expression nodes. This has a higher precedence than an alternation, so e.g `abc|def` matches against `abc` or `def`. A concatenation may consist of potentially quantified expressions, trivia such as inline comments, and quoted sequences `\Q...\E`.
73
+
Each component of a concatenation may be "trivia" (comments and non-semantic whitespace, if applicable), a quoted run of literal content, or a potentially-quantified subexpression.
74
+
75
+
In-line comments, similarly to C, are lexical and are not recursively nested like normal groups are. A closing `)` cannot be escaped. Quotes are similarly lexical, non-nested, and the `\` before a `\E` cannot be escaped.
66
76
67
-
### Quantification
77
+
For example, `\Q^[xy]+$\E`, is treated as the literal characters `^[xy]+$` rather than an anchored quantified character class. `\Q\\E` is a literal `\`.
78
+
79
+
### Quantified subexpressions
68
80
69
81
```
70
82
Quantification -> QuantOperand Quantifier?
@@ -453,36 +465,39 @@ The backslash character is also treated as literal within a quoted sequence, and
A reference is an abstract identifier for a particular capturing group in a regular expression. It can either be named or numbered, and in the latter case may be specified relative to the current group. For example `-2` refers to the capture group `N - 2` where `N` is the number of the next capture group. References may refer to groups ahead of the current position e.g `+3`, or the name of a future group. These may be useful in recursive cases where the group being referenced has been matched in a prior iteration.
462
475
476
+
A backreference may optionally include a recursion level in certain cases, which is a syntactic element inherited from Oniguruma that allows the reference to specify a capture relative to a given recursion level.
477
+
463
478
#### Backreferences
464
479
465
480
```
466
-
Backreference -> '\g{' NameOrNumberRef '}'
481
+
Backreference -> '\g{' NamedOrNumberRef '}'
467
482
| '\g' NumberRef
468
-
| '\k<' NameOrNumberRef '>'
469
-
| "\k'" NameOrNumberRef "'"
470
-
| '\k{' Identifier '}'
483
+
| '\k<' NamedOrNumberRef '>'
484
+
| "\k'" NamedOrNumberRef "'"
485
+
| '\k{' NamedRef '}'
471
486
| '\' [1-9] [0-9]+
472
-
| '(?P=' Identifier ')'
487
+
| '(?P=' NamedRef ')'
473
488
```
474
489
475
490
A backreference evaluates to the value last captured by the referenced capturing group. If the referenced capture has not been evaluated yet, the match fails.
@@ -806,26 +821,26 @@ We intend on canonicalizing to the short-form versions of these group kinds, e.g
806
821
### Backreferences
807
822
808
823
```
809
-
Backreference -> '\g{' NameOrNumberRef '}'
824
+
Backreference -> '\g{' NamedOrNumberRef '}'
810
825
| '\g' NumberRef
811
-
| '\k<' NameOrNumberRef '>'
812
-
| "\k'" NameOrNumberRef "'"
813
-
| '\k{' Identifier '}'
826
+
| '\k<' NamedOrNumberRef '>'
827
+
| "\k'" NamedOrNumberRef "'"
828
+
| '\k{' NamedRef '}'
814
829
| '\' [1-9] [0-9]+
815
-
| '(?P=' Identifier ')'
830
+
| '(?P=' NamedRef ')'
816
831
```
817
832
818
833
For absolute numeric references, we plan on choosing the canonical spelling `\DDD`, as it is unambiguous with octal sequences. For relative numbered references, as well as named references, we intend on canonicalizing to `\k<...>` to match the group name canonicalization `(?<...>)`. **TODO: How valuable is it to have canonical `\DDD`? Would it be better to just use `\k<...>` for everything?**
819
834
820
835
### Subpatterns
821
836
822
837
```
823
-
Subpattern -> '\g<' NameOrNumberRef '>'
824
-
| "\g'" NameOrNumberRef "'"
838
+
Subpattern -> '\g<' NamedOrNumberRef '>'
839
+
| "\g'" NamedOrNumberRef "'"
825
840
| '(?' GroupLikeSubpatternBody ')'
826
841
827
-
GroupLikeSubpatternBody -> 'P>' <String>
828
-
| '&' <String>
842
+
GroupLikeSubpatternBody -> 'P>' NamedRef
843
+
| '&' NamedRef
829
844
| 'R'
830
845
| NumberRef
831
846
```
@@ -837,9 +852,9 @@ We intend on canonicalizing to the `\g<...>` spelling. **TODO: For `(?R)` too?**
0 commit comments