Skip to content

Commit fb7f380

Browse files
committed
Update RegexSyntax.md
1 parent bbb7756 commit fb7f380

File tree

1 file changed

+67
-12
lines changed

1 file changed

+67
-12
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 67 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -461,14 +461,14 @@ Branch reset groups can alter this numbering, as they reset the numbering in the
461461
```
462462
Backreference -> '\g{' NameOrNumberRef '}'
463463
| '\g' NumberRef
464-
| '\k<' Identifier '>'
465-
| "\k'" Identifier "'"
464+
| '\k<' NameOrNumberRef '>'
465+
| "\k'" NameOrNumberRef "'"
466466
| '\k{' Identifier '}'
467467
| '\' [1-9] [0-9]+
468468
| '(?P=' Identifier ')'
469469
```
470470

471-
A backreference evaluates to the value last captured by the referenced capturing group.
471+
A backreference evaluates to the value last captured by the referenced capturing group. If the referenced capture has not been evaluated yet, the match fails.
472472

473473
#### Subpatterns
474474

@@ -483,7 +483,7 @@ GroupLikeSubpatternBody -> 'P>' <String>
483483
| NumberRef
484484
```
485485

486-
A subpattern causes the referenced group to be re-evaluated at the current position. The syntax `(?R)` is equivalent to `(?0)`, and causes the entire pattern to be recursed.
486+
A subpattern causes the referenced capture group to be re-evaluated at the current position. The syntax `(?R)` is equivalent to `(?0)`, and causes the entire pattern to be recursed.
487487

488488
### Conditionals
489489

@@ -754,39 +754,94 @@ We intend on matching the PCRE behavior where groups are numbered purely based o
754754

755755
## Canonical representations
756756

757-
Many engines have different spellings for the same regex features, and as such we need to decide on a preferred canonical syntax.
757+
Many engines have different spellings for the same regex features, we intend to support parsing. However, for the purposes of e.g printing, we need to decide on a canonical syntax for various constructs.
758758

759759
### Unicode scalars
760760

761+
```
762+
UniScalar -> '\u{' HexDigit{1...} '}'
763+
| '\u' HexDigit{4}
764+
| '\x{' HexDigit{1...} '}'
765+
| '\x' HexDigit{0...2}
766+
| '\U' HexDigit{8}
767+
| '\o{' OctalDigit{1...} '}'
768+
| '\0' OctalDigit{0...3}
769+
770+
HexDigit -> [0-9a-zA-Z]
771+
OctalDigit -> [0-7]
772+
```
773+
774+
For consistency with String escape syntax, we intend on canonicalizing to `\u{...}`.
775+
761776
### Character properties
762777

778+
**TODO: Should we canonicalize on e.g `\p{Script_Extensions=Greek}`? Or prefer the shorthand where we can? Or just avoid canonicalizing?**
779+
763780
### Groups
764781

765782
#### Named
766783

784+
```
785+
NamedGroup -> 'P<' GroupNameBody '>'
786+
| '<' GroupNameBody '>'
787+
| "'" GroupNameBody "'"
788+
```
789+
790+
We intend on canonicalizing to the `(?<...>)` spelling.
791+
767792
#### Lookaheads and lookbehinds
768793

769-
### Backreferences
794+
We intend on canonicalizing to the short-form versions of these group kinds, e.g `(?=`.
770795

771-
There are a variety of backreference spellings accepted by different engines
796+
### Backreferences
772797

773798
```
774799
Backreference -> '\g{' NameOrNumberRef '}'
775800
| '\g' NumberRef
776-
| '\k<' Identifier '>'
777-
| "\k'" Identifier "'"
801+
| '\k<' NameOrNumberRef '>'
802+
| "\k'" NameOrNumberRef "'"
778803
| '\k{' Identifier '}'
779804
| '\' [1-9] [0-9]+
780805
| '(?P=' Identifier ')'
781806
```
782807

783-
We plan on choosing the canonical spelling **TODO: decide**.
808+
For absolute numeric references, we plan on choosing the canonical spelling `\DDD`, as it is unambiguous with octal sequences. For relative numbered references, as well as named references, we intend on canonicalizing to `\k<...>` to match the group name canonicalization `(?<...>)`. **TODO: How valuable is it to have canonical `\DDD`? Would it be better to just use `\k<...>` for everything?**
809+
810+
### Subpatterns
811+
812+
```
813+
Subpattern -> '\g<' NameOrNumberRef '>'
814+
| "\g'" NameOrNumberRef "'"
815+
| '(?' GroupLikeSubpatternBody ')'
816+
817+
GroupLikeSubpatternBody -> 'P>' <String>
818+
| '&' <String>
819+
| 'R'
820+
| NumberRef
821+
```
784822

785-
### Subpattern
823+
We intend on canonicalizing to the `\g<...>` spelling. **TODO: For `(?R)` too?**
786824

787825
### Conditional references
788826

789-
### Callouts
827+
**TODO: Decide**
828+
829+
### PCRE Callouts
830+
831+
```
832+
PCRECallout -> '(?C' CalloutBody ')'
833+
PCRECalloutBody -> '' | <Number>
834+
| '`' <String> '`'
835+
| "'" <String> "'"
836+
| '"' <String> '"'
837+
| '^' <String> '^'
838+
| '%' <String> '%'
839+
| '#' <String> '#'
840+
| '$' <String> '$'
841+
| '{' <String> '}'
842+
```
843+
844+
PCRE accepts a number of alternative delimiters for callout string arguments. We intend to canonicalize to `(?C"...")`. **TODO: May want to alter if we choose `r"..."`, though lexing should be able to handle it by looking for the `(?C` prefix**.
790845

791846

792847
[pcre2-syntax]: https://www.pcre.org/current/doc/html/pcre2syntax.html

0 commit comments

Comments
 (0)