Skip to content

Commit b7dfa1e

Browse files
committed
Update RegexSyntax.md
1 parent dbdef62 commit b7dfa1e

File tree

1 file changed

+43
-6
lines changed

1 file changed

+43
-6
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 43 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -138,14 +138,19 @@ GroupNameBody -> Identifier | BalancingGroupBody
138138
Identifier -> [\w--\d] \w*
139139
```
140140

141-
Groups define a new scope within which a recursive regular expression pattern may occur. Groups have different semantics depending on how they are introduced, some may capture the nested match, some may match against the input without advancing, some may change the matching options set in the new scope, etc.
141+
Groups define a new scope within which a recursive regular expression pattern may occur. Groups have different semantics depending on how they are introduced.
142142

143143
Groups may be named, the characters of which may be any letter or number characters (or `_`). However the name must not start with a number. This restriction follows the behavior of other regex engines and avoids ambiguities when it comes to named and numeric group references.
144144

145145
Groups may be used to change the matching options present within their scope, see the *Matching options* section.
146146

147147
Note there are additional constructs that may syntactically appear similar to groups, but are distinct. See the *group-like atoms* section.
148148

149+
#### Basic group kinds
150+
151+
- `()`: A capturing group.
152+
- `(?:)`: A non-capturing group.
153+
- `(?|)`: A group that, for a direct child alternation, resets the numbering of groups at each branch of that alternation. See *group numbering*.
149154

150155
#### Lookahead and lookbehind
151156

@@ -154,6 +159,8 @@ Note there are additional constructs that may syntactically appear similar to gr
154159
- `(?<=` specifies a lookbehind that attempts to match the group body against the input before the current position. Does not advance the input.
155160
- `(?!<` specifies a negative lookbehind that ensures the group body does not match the input before the current position. Does not advance the input.
156161

162+
**TODO: Non-atomic variants**
163+
157164
PCRE2 defines explicitly spelled out versions of the syntax, e.g `(*negative_lookbehind:)`.
158165

159166
#### Atomic groups
@@ -340,15 +347,23 @@ PropertyContents -> PropertyName ('=' PropertyName)?
340347
PropertyName -> [\s\w-]+
341348
```
342349

343-
A character property specifies a particular Unicode or POSIX property to match against. Fuzzy matching is used when parsing the property name, and is done according to rules set out by [UAX44-LM3]. This means that the following property names are considered equivalent:
350+
A character property specifies a particular Unicode or POSIX property to match against. We intend on parsing:
351+
352+
- The full range of Unicode character properties.
353+
- The POSIX properties `alnum`, `blank`, `graph`, `print`, `word`, `xdigit` (note that `alpha`, `lower`, `upper`, `space`, `punct`, `digit`, and `cntrl` are covered by Unicode properties).
354+
- The UTS#18 special properties `any`, `assigned`, `ascii`.
355+
356+
We intend on following [UTS#18][uts18]'s guidance for character properties. This includes the use of fuzzy matching for property name parsing. This is done according to rules set out by [UAX44-LM3]. This means that the following property names are considered equivalent:
344357

345358
- `whitespace`
346359
- `isWhitespace`
347360
- `is-White_Space`
348361
- `iSwHiTeSpaCe`
349362
- `i s w h i t e s p a c e`
350363

351-
Unicode properties consist of both a key and a value, e.g `General_Category=Whitespace`. However there are some properties where the key or value may be inferred. These include:
364+
Unicode properties consist of both a key and a value, e.g `General_Category=Whitespace`. Each component follows the fuzzy matching rule, and additionally may have an alternative alias spelling, as defined by Unicode in [PropertyAliases.txt][unicode-prop-key-aliases] and [PropertyValueAliases.txt][unicode-prop-value-aliases].
365+
366+
There are some Unicode properties where the key or value may be inferred. These include:
352367

353368
- General category properties e.g `\p{Whitespace}` is inferred as `\p{General_Category=Whitespace}`.
354369
- Script properties e.g `\p{Greek}` is inferred as `\p{Script=Greek}`.
@@ -361,8 +376,6 @@ For non-Unicode properties, only a value is required. These include:
361376
- The special properties `any`, `assigned`, `ascii`.
362377
- The POSIX compatibility properties `alnum`, `blank`, `graph`, `print`, `word`, `xdigit`. The remaining POSIX properties are already covered by boolean Unicode property spellings.
363378

364-
**TODO: Spell out the properties we recognize while parsing vs. those we just parse as String?**
365-
366379
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`.
367380

368381
### Named characters
@@ -412,7 +425,25 @@ RecursionLevel -> '+' <Int> | '-' <Int>
412425

413426
A reference is an abstract identifier for a particular capturing group in a regular expression. It can either be named or numbered, and in the latter case may be specified relative to the current group. For example `-2` refers to the capture group `N - 2` where `N` is the number of the next capture group. References may refer to groups ahead of the current position e.g `+3`, or the name of a future group. These may be useful in recursive cases where the group being referenced has been matched in a prior iteration.
414427

415-
**TODO: Describe how capture groups are numbered? Including nesting & resets?**
428+
#### Group numbering
429+
430+
Capturing groups are implicitly numbered according to the position of their opening `(` in the regex. For example:
431+
432+
```
433+
( ((?:a)(?<b>b)c)(d)e)
434+
^ ^ ^ ^
435+
1 2 3 4
436+
```
437+
438+
Non-capturing groups are skipped over when counting.
439+
440+
Branch reset groups can alter this numbering, as they reset the numbering in the branches of an alternation child, for example:
441+
442+
```
443+
( ()(?|(a)(b)|(?:c)|(d)))()
444+
^ ^ ^ ^ ^ ^
445+
1 2 3 4 3 5
446+
```
416447

417448
#### Backreferences
418449

@@ -690,6 +721,10 @@ Different regex engines also have different rules around what characters are con
690721

691722
This is a subset of the scalars matched by `UnicodeScalar.isWhitespace`. Additionally, in a custom character class, PCRE only considers the space and tab characters as whitespace. Other engines do not differentiate between whitespace characters inside and outside custom character classes, and appear to follow a subset of this list. Therefore we intend to support exactly the characters in this list for the purposes of non-semantic whitespace.
692723

724+
### Group numbering
725+
726+
**TODO: Discuss how .NET numbers its groups differently**
727+
693728
## Canonical representations
694729

695730
Many engines have different spellings for the same regex features, and as such we need to decide on a preferred canonical syntax.
@@ -717,3 +752,5 @@ We plan on choosing the canonical spelling **TODO: decide**.
717752
[uts18]: https://www.unicode.org/reports/tr18/
718753
[.net-syntax]: https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expressions
719754
[UAX44-LM3]: https://www.unicode.org/reports/tr44/#UAX44-LM3
755+
[unicode-prop-key-aliases]: https://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt
756+
[unicode-prop-value-aliases]: https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt

0 commit comments

Comments
 (0)