You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Groups define a new scope within which a recursive regular expression pattern may occur. Groups have different semantics depending on how they are introduced, some may capture the nested match, some may match against the input without advancing, some may change the matching options set in the new scope, etc.
141
+
Groups define a new scope within which a recursive regular expression pattern may occur. Groups have different semantics depending on how they are introduced.
142
142
143
143
Groups may be named, the characters of which may be any letter or number characters (or `_`). However the name must not start with a number. This restriction follows the behavior of other regex engines and avoids ambiguities when it comes to named and numeric group references.
144
144
145
145
Groups may be used to change the matching options present within their scope, see the *Matching options* section.
146
146
147
147
Note there are additional constructs that may syntactically appear similar to groups, but are distinct. See the *group-like atoms* section.
148
148
149
+
#### Basic group kinds
150
+
151
+
-`()`: A capturing group.
152
+
-`(?:)`: A non-capturing group.
153
+
-`(?|)`: A group that, for a direct child alternation, resets the numbering of groups at each branch of that alternation. See *group numbering*.
149
154
150
155
#### Lookahead and lookbehind
151
156
@@ -154,6 +159,8 @@ Note there are additional constructs that may syntactically appear similar to gr
154
159
-`(?<=` specifies a lookbehind that attempts to match the group body against the input before the current position. Does not advance the input.
155
160
-`(?!<` specifies a negative lookbehind that ensures the group body does not match the input before the current position. Does not advance the input.
156
161
162
+
**TODO: Non-atomic variants**
163
+
157
164
PCRE2 defines explicitly spelled out versions of the syntax, e.g `(*negative_lookbehind:)`.
A character property specifies a particular Unicode or POSIX property to match against. Fuzzy matching is used when parsing the property name, and is done according to rules set out by [UAX44-LM3]. This means that the following property names are considered equivalent:
350
+
A character property specifies a particular Unicode or POSIX property to match against. We intend on parsing:
351
+
352
+
- The full range of Unicode character properties.
353
+
- The POSIX properties `alnum`, `blank`, `graph`, `print`, `word`, `xdigit` (note that `alpha`, `lower`, `upper`, `space`, `punct`, `digit`, and `cntrl` are covered by Unicode properties).
354
+
- The UTS#18 special properties `any`, `assigned`, `ascii`.
355
+
356
+
We intend on following [UTS#18][uts18]'s guidance for character properties. This includes the use of fuzzy matching for property name parsing. This is done according to rules set out by [UAX44-LM3]. This means that the following property names are considered equivalent:
344
357
345
358
-`whitespace`
346
359
-`isWhitespace`
347
360
-`is-White_Space`
348
361
-`iSwHiTeSpaCe`
349
362
-`i s w h i t e s p a c e`
350
363
351
-
Unicode properties consist of both a key and a value, e.g `General_Category=Whitespace`. However there are some properties where the key or value may be inferred. These include:
364
+
Unicode properties consist of both a key and a value, e.g `General_Category=Whitespace`. Each component follows the fuzzy matching rule, and additionally may have an alternative alias spelling, as defined by Unicode in [PropertyAliases.txt][unicode-prop-key-aliases] and [PropertyValueAliases.txt][unicode-prop-value-aliases].
365
+
366
+
There are some Unicode properties where the key or value may be inferred. These include:
352
367
353
368
- General category properties e.g `\p{Whitespace}` is inferred as `\p{General_Category=Whitespace}`.
354
369
- Script properties e.g `\p{Greek}` is inferred as `\p{Script=Greek}`.
@@ -361,8 +376,6 @@ For non-Unicode properties, only a value is required. These include:
361
376
- The special properties `any`, `assigned`, `ascii`.
362
377
- The POSIX compatibility properties `alnum`, `blank`, `graph`, `print`, `word`, `xdigit`. The remaining POSIX properties are already covered by boolean Unicode property spellings.
363
378
364
-
**TODO: Spell out the properties we recognize while parsing vs. those we just parse as String?**
365
-
366
379
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`.
A reference is an abstract identifier for a particular capturing group in a regular expression. It can either be named or numbered, and in the latter case may be specified relative to the current group. For example `-2` refers to the capture group `N - 2` where `N` is the number of the next capture group. References may refer to groups ahead of the current position e.g `+3`, or the name of a future group. These may be useful in recursive cases where the group being referenced has been matched in a prior iteration.
414
427
415
-
**TODO: Describe how capture groups are numbered? Including nesting & resets?**
428
+
#### Group numbering
429
+
430
+
Capturing groups are implicitly numbered according to the position of their opening `(` in the regex. For example:
431
+
432
+
```
433
+
( ((?:a)(?<b>b)c)(d)e)
434
+
^ ^ ^ ^
435
+
1 2 3 4
436
+
```
437
+
438
+
Non-capturing groups are skipped over when counting.
439
+
440
+
Branch reset groups can alter this numbering, as they reset the numbering in the branches of an alternation child, for example:
441
+
442
+
```
443
+
( ()(?|(a)(b)|(?:c)|(d)))()
444
+
^ ^ ^ ^ ^ ^
445
+
1 2 3 4 3 5
446
+
```
416
447
417
448
#### Backreferences
418
449
@@ -690,6 +721,10 @@ Different regex engines also have different rules around what characters are con
690
721
691
722
This is a subset of the scalars matched by `UnicodeScalar.isWhitespace`. Additionally, in a custom character class, PCRE only considers the space and tab characters as whitespace. Other engines do not differentiate between whitespace characters inside and outside custom character classes, and appear to follow a subset of this list. Therefore we intend to support exactly the characters in this list for the purposes of non-semantic whitespace.
692
723
724
+
### Group numbering
725
+
726
+
**TODO: Discuss how .NET numbers its groups differently**
727
+
693
728
## Canonical representations
694
729
695
730
Many engines have different spellings for the same regex features, and as such we need to decide on a preferred canonical syntax.
@@ -717,3 +752,5 @@ We plan on choosing the canonical spelling **TODO: decide**.
0 commit comments