You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexSyntax.md
+67-12Lines changed: 67 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -461,14 +461,14 @@ Branch reset groups can alter this numbering, as they reset the numbering in the
461
461
```
462
462
Backreference -> '\g{' NameOrNumberRef '}'
463
463
| '\g' NumberRef
464
-
| '\k<' Identifier '>'
465
-
| "\k'" Identifier "'"
464
+
| '\k<' NameOrNumberRef '>'
465
+
| "\k'" NameOrNumberRef "'"
466
466
| '\k{' Identifier '}'
467
467
| '\' [1-9] [0-9]+
468
468
| '(?P=' Identifier ')'
469
469
```
470
470
471
-
A backreference evaluates to the value last captured by the referenced capturing group.
471
+
A backreference evaluates to the value last captured by the referenced capturing group. If the referenced capture has not been evaluated yet, the match fails.
A subpattern causes the referenced group to be re-evaluated at the current position. The syntax `(?R)` is equivalent to `(?0)`, and causes the entire pattern to be recursed.
486
+
A subpattern causes the referenced capture group to be re-evaluated at the current position. The syntax `(?R)` is equivalent to `(?0)`, and causes the entire pattern to be recursed.
487
487
488
488
### Conditionals
489
489
@@ -754,39 +754,94 @@ We intend on matching the PCRE behavior where groups are numbered purely based o
754
754
755
755
## Canonical representations
756
756
757
-
Many engines have different spellings for the same regex features, and as such we need to decide on a preferred canonical syntax.
757
+
Many engines have different spellings for the same regex features, we intend to support parsing. However, for the purposes of e.g printing, we need to decide on a canonical syntax for various constructs.
758
758
759
759
### Unicode scalars
760
760
761
+
```
762
+
UniScalar -> '\u{' HexDigit{1...} '}'
763
+
| '\u' HexDigit{4}
764
+
| '\x{' HexDigit{1...} '}'
765
+
| '\x' HexDigit{0...2}
766
+
| '\U' HexDigit{8}
767
+
| '\o{' OctalDigit{1...} '}'
768
+
| '\0' OctalDigit{0...3}
769
+
770
+
HexDigit -> [0-9a-zA-Z]
771
+
OctalDigit -> [0-7]
772
+
```
773
+
774
+
For consistency with String escape syntax, we intend on canonicalizing to `\u{...}`.
775
+
761
776
### Character properties
762
777
778
+
**TODO: Should we canonicalize on e.g `\p{Script_Extensions=Greek}`? Or prefer the shorthand where we can? Or just avoid canonicalizing?**
779
+
763
780
### Groups
764
781
765
782
#### Named
766
783
784
+
```
785
+
NamedGroup -> 'P<' GroupNameBody '>'
786
+
| '<' GroupNameBody '>'
787
+
| "'" GroupNameBody "'"
788
+
```
789
+
790
+
We intend on canonicalizing to the `(?<...>)` spelling.
791
+
767
792
#### Lookaheads and lookbehinds
768
793
769
-
### Backreferences
794
+
We intend on canonicalizing to the short-form versions of these group kinds, e.g `(?=`.
770
795
771
-
There are a variety of backreference spellings accepted by different engines
796
+
### Backreferences
772
797
773
798
```
774
799
Backreference -> '\g{' NameOrNumberRef '}'
775
800
| '\g' NumberRef
776
-
| '\k<' Identifier '>'
777
-
| "\k'" Identifier "'"
801
+
| '\k<' NameOrNumberRef '>'
802
+
| "\k'" NameOrNumberRef "'"
778
803
| '\k{' Identifier '}'
779
804
| '\' [1-9] [0-9]+
780
805
| '(?P=' Identifier ')'
781
806
```
782
807
783
-
We plan on choosing the canonical spelling **TODO: decide**.
808
+
For absolute numeric references, we plan on choosing the canonical spelling `\DDD`, as it is unambiguous with octal sequences. For relative numbered references, as well as named references, we intend on canonicalizing to `\k<...>` to match the group name canonicalization `(?<...>)`. **TODO: How valuable is it to have canonical `\DDD`? Would it be better to just use `\k<...>` for everything?**
809
+
810
+
### Subpatterns
811
+
812
+
```
813
+
Subpattern -> '\g<' NameOrNumberRef '>'
814
+
| "\g'" NameOrNumberRef "'"
815
+
| '(?' GroupLikeSubpatternBody ')'
816
+
817
+
GroupLikeSubpatternBody -> 'P>' <String>
818
+
| '&' <String>
819
+
| 'R'
820
+
| NumberRef
821
+
```
784
822
785
-
### Subpattern
823
+
We intend on canonicalizing to the `\g<...>` spelling. **TODO: For `(?R)` too?**
786
824
787
825
### Conditional references
788
826
789
-
### Callouts
827
+
**TODO: Decide**
828
+
829
+
### PCRE Callouts
830
+
831
+
```
832
+
PCRECallout -> '(?C' CalloutBody ')'
833
+
PCRECalloutBody -> '' | <Number>
834
+
| '`' <String> '`'
835
+
| "'" <String> "'"
836
+
| '"' <String> '"'
837
+
| '^' <String> '^'
838
+
| '%' <String> '%'
839
+
| '#' <String> '#'
840
+
| '$' <String> '$'
841
+
| '{' <String> '}'
842
+
```
843
+
844
+
PCRE accepts a number of alternative delimiters for callout string arguments. We intend to canonicalize to `(?C"...")`. **TODO: May want to alter if we choose `r"..."`, though lexing should be able to handle it by looking for the `(?C` prefix**.
0 commit comments