You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexSyntax.md
+28-15Lines changed: 28 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -140,7 +140,7 @@ Identifier -> [\w--\d] \w*
140
140
141
141
Groups define a new scope within which a recursive regular expression pattern may occur. Groups have different semantics depending on how they are introduced.
142
142
143
-
Groups may be named, the characters of which may be any letter or number characters (or `_`). However the name must not start with a number. This restriction follows the behavior of other regex engines and avoids ambiguities when it comes to named and numeric group references.
143
+
Groups may be named, the characters of which may be any letter or number characters or the character `_`. However the name must not start with a number. This restriction follows the behavior of other regex engines and avoids ambiguities when it comes to named and numeric group references.
144
144
145
145
Groups may be used to change the matching options present within their scope, see the *Matching options* section.
146
146
@@ -152,20 +152,23 @@ Note there are additional constructs that may syntactically appear similar to gr
152
152
-`(?:)`: A non-capturing group.
153
153
-`(?|)`: A group that, for a direct child alternation, resets the numbering of groups at each branch of that alternation. See *group numbering*.
154
154
155
-
#### Lookahead and lookbehind
155
+
#### Atomic groups
156
156
157
-
-`(?=` specifies a lookahead that attempts to match against the group body, but does not advance.
158
-
-`(?!` specifies a negative lookahead that ensures the group body does not match, and does not advance.
159
-
-`(?<=` specifies a lookbehind that attempts to match the group body against the input before the current position. Does not advance the input.
160
-
-`(?!<` specifies a negative lookbehind that ensures the group body does not match the input before the current position. Does not advance the input.
157
+
An atomic group e.g `(?>...)` specifies that its contents should not be re-evaluated for backtracking. This the same semantics as a possessive quantifier, but applies more generally to any regex pattern.
161
158
162
-
**TODO: Non-atomic variants**
159
+
#### Lookahead and lookbehind
163
160
164
-
PCRE2 defines explicitly spelled out versions of the syntax, e.g `(*negative_lookbehind:)`.
161
+
-`(?=`: A lookahead that attempts to match against the group body, but does not advance.
162
+
-`(?!`: A negative lookahead that ensures the group body does not match, and does not advance.
163
+
-`(?<=`: A lookbehind that attempts to match the group body against the input before the current position. Does not advance the input.
164
+
-`(?!<`: A negative lookbehind that ensures the group body does not match the input before the current position. Does not advance the input.
165
165
166
-
#### Atomic groups
166
+
These groups are all atomic, meaning that they will not be re-evaluated for backtracking. There are however also non-atomic variants:
167
167
168
-
An atomic group e.g `(?>...)` specifies that its contents should not be re-evaluated for backtracking. This the same semantics as a possessive quantifier, but applies more generally to any regex pattern.
168
+
-`(?*`: A non-atomic lookahead.
169
+
-`(?<*`: A non-atomic lookbehind.
170
+
171
+
PCRE2 also defines explicitly spelled out versions of the above syntax, e.g `(*non_atomic_positive_lookahead` and `(*negative_lookbehind:)`.
169
172
170
173
#### Script runs
171
174
@@ -430,17 +433,17 @@ A reference is an abstract identifier for a particular capturing group in a regu
430
433
Capturing groups are implicitly numbered according to the position of their opening `(` in the regex. For example:
431
434
432
435
```
433
-
(((?:a)(?<b>b)c)(d)e)
436
+
(a((?:b)(?<c>c)d)(e)f)
434
437
^ ^ ^ ^
435
438
1 2 3 4
436
439
```
437
440
438
441
Non-capturing groups are skipped over when counting.
439
442
440
-
Branch reset groups can alter this numbering, as they reset the numbering in the branches of an alternation child, for example:
443
+
Branch reset groups can alter this numbering, as they reset the numbering in the branches of an alternation child. Outside the alternation, numbering resumes at the next available number not used in one of the branches. For example:
441
444
442
445
```
443
-
(()(?|(a)(b)|(?:c)|(d)))()
446
+
(a()(?|(b)(c)|(?:d)|(e)))(f)
444
447
^ ^ ^ ^ ^ ^
445
448
1 2 3 4 3 5
446
449
```
@@ -503,7 +506,7 @@ A condition may be:
503
506
504
507
- A reference to a capture group, which checks whether the group matched successfully.
505
508
- A recursion check on either a particular group or the entire regex. In the former case, this checks to see if the last recursive call is through that group. In the latter case, it checks if the match is currently taking place in any kind of recursive call.
506
-
- An arbitrary recursive regular expression, which is matched against, and evaluates to true if the match is successful. (**TODO: Clarify whether it introduces captures**)
509
+
- An arbitrary recursive regular expression, which is matched against, and evaluates to true if the match is successful. It may contain capture groups that add captures to the match.
507
510
- A PCRE version check.
508
511
509
512
The `DEFINE` keyword is not used as a condition, but rather a way in which to define a group which is not evaluated, but may be referenced by a subpattern.
@@ -723,7 +726,17 @@ This is a subset of the scalars matched by `UnicodeScalar.isWhitespace`. Additio
723
726
724
727
### Group numbering
725
728
726
-
**TODO: Discuss how .NET numbers its groups differently**
729
+
In PCRE, groups are numbered according to the position of their opening parenthesis. .NET also follows this rule, with the exception that named groups are numbered after unnamed groups. For example:
730
+
731
+
```
732
+
(a(?<x>x)b)(?<y>y)(z)
733
+
^ ^ ^ ^
734
+
1 3 4 2
735
+
```
736
+
737
+
The `(z)` group gets numbered before the named groups get numbered.
738
+
739
+
We intend on matching the PCRE behavior where groups are numbered purely based on order.
0 commit comments