Skip to content

Commit 8dc2b4b

Browse files
committed
Update RegexSyntax.md
1 parent b7dfa1e commit 8dc2b4b

File tree

1 file changed

+28
-15
lines changed

1 file changed

+28
-15
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 28 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -140,7 +140,7 @@ Identifier -> [\w--\d] \w*
140140

141141
Groups define a new scope within which a recursive regular expression pattern may occur. Groups have different semantics depending on how they are introduced.
142142

143-
Groups may be named, the characters of which may be any letter or number characters (or `_`). However the name must not start with a number. This restriction follows the behavior of other regex engines and avoids ambiguities when it comes to named and numeric group references.
143+
Groups may be named, the characters of which may be any letter or number characters or the character `_`. However the name must not start with a number. This restriction follows the behavior of other regex engines and avoids ambiguities when it comes to named and numeric group references.
144144

145145
Groups may be used to change the matching options present within their scope, see the *Matching options* section.
146146

@@ -152,20 +152,23 @@ Note there are additional constructs that may syntactically appear similar to gr
152152
- `(?:)`: A non-capturing group.
153153
- `(?|)`: A group that, for a direct child alternation, resets the numbering of groups at each branch of that alternation. See *group numbering*.
154154

155-
#### Lookahead and lookbehind
155+
#### Atomic groups
156156

157-
- `(?=` specifies a lookahead that attempts to match against the group body, but does not advance.
158-
- `(?!` specifies a negative lookahead that ensures the group body does not match, and does not advance.
159-
- `(?<=` specifies a lookbehind that attempts to match the group body against the input before the current position. Does not advance the input.
160-
- `(?!<` specifies a negative lookbehind that ensures the group body does not match the input before the current position. Does not advance the input.
157+
An atomic group e.g `(?>...)` specifies that its contents should not be re-evaluated for backtracking. This the same semantics as a possessive quantifier, but applies more generally to any regex pattern.
161158

162-
**TODO: Non-atomic variants**
159+
#### Lookahead and lookbehind
163160

164-
PCRE2 defines explicitly spelled out versions of the syntax, e.g `(*negative_lookbehind:)`.
161+
- `(?=`: A lookahead that attempts to match against the group body, but does not advance.
162+
- `(?!`: A negative lookahead that ensures the group body does not match, and does not advance.
163+
- `(?<=`: A lookbehind that attempts to match the group body against the input before the current position. Does not advance the input.
164+
- `(?!<`: A negative lookbehind that ensures the group body does not match the input before the current position. Does not advance the input.
165165

166-
#### Atomic groups
166+
These groups are all atomic, meaning that they will not be re-evaluated for backtracking. There are however also non-atomic variants:
167167

168-
An atomic group e.g `(?>...)` specifies that its contents should not be re-evaluated for backtracking. This the same semantics as a possessive quantifier, but applies more generally to any regex pattern.
168+
- `(?*`: A non-atomic lookahead.
169+
- `(?<*`: A non-atomic lookbehind.
170+
171+
PCRE2 also defines explicitly spelled out versions of the above syntax, e.g `(*non_atomic_positive_lookahead` and `(*negative_lookbehind:)`.
169172

170173
#### Script runs
171174

@@ -430,17 +433,17 @@ A reference is an abstract identifier for a particular capturing group in a regu
430433
Capturing groups are implicitly numbered according to the position of their opening `(` in the regex. For example:
431434

432435
```
433-
( ((?:a)(?<b>b)c)(d)e)
436+
(a((?:b)(?<c>c)d)(e)f)
434437
^ ^ ^ ^
435438
1 2 3 4
436439
```
437440

438441
Non-capturing groups are skipped over when counting.
439442

440-
Branch reset groups can alter this numbering, as they reset the numbering in the branches of an alternation child, for example:
443+
Branch reset groups can alter this numbering, as they reset the numbering in the branches of an alternation child. Outside the alternation, numbering resumes at the next available number not used in one of the branches. For example:
441444

442445
```
443-
( ()(?|(a)(b)|(?:c)|(d)))()
446+
(a()(?|(b)(c)|(?:d)|(e)))(f)
444447
^ ^ ^ ^ ^ ^
445448
1 2 3 4 3 5
446449
```
@@ -503,7 +506,7 @@ A condition may be:
503506

504507
- A reference to a capture group, which checks whether the group matched successfully.
505508
- A recursion check on either a particular group or the entire regex. In the former case, this checks to see if the last recursive call is through that group. In the latter case, it checks if the match is currently taking place in any kind of recursive call.
506-
- An arbitrary recursive regular expression, which is matched against, and evaluates to true if the match is successful. (**TODO: Clarify whether it introduces captures**)
509+
- An arbitrary recursive regular expression, which is matched against, and evaluates to true if the match is successful. It may contain capture groups that add captures to the match.
507510
- A PCRE version check.
508511

509512
The `DEFINE` keyword is not used as a condition, but rather a way in which to define a group which is not evaluated, but may be referenced by a subpattern.
@@ -723,7 +726,17 @@ This is a subset of the scalars matched by `UnicodeScalar.isWhitespace`. Additio
723726

724727
### Group numbering
725728

726-
**TODO: Discuss how .NET numbers its groups differently**
729+
In PCRE, groups are numbered according to the position of their opening parenthesis. .NET also follows this rule, with the exception that named groups are numbered after unnamed groups. For example:
730+
731+
```
732+
(a(?<x>x)b)(?<y>y)(z)
733+
^ ^ ^ ^
734+
1 3 4 2
735+
```
736+
737+
The `(z)` group gets numbered before the named groups get numbered.
738+
739+
We intend on matching the PCRE behavior where groups are numbered purely based on order.
727740

728741
## Canonical representations
729742

0 commit comments

Comments
 (0)