You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexSyntax.md
+10-9Lines changed: 10 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -814,23 +814,23 @@ There are multiple equivalent ways of spelling the same the Unicode scalar value
814
814
815
815
Character properties `\p{...}` have a variety of alternative spellings due to fuzzy matching, Unicode aliases, and shorthand syntax for common Unicode properties. They also may be written using POSIX syntax e.g `[:gc=Whitespace:]`.
816
816
817
-
**TODO: Should we canonicalize on e.g `\p{Script_Extensions=Greek}`? Or prefer the shorthand where we can? Or just avoid canonicalizing?**
817
+
**TODO: Should we suggest canonicalizing on e.g `\p{Script_Extensions=Greek}`? Or prefer the shorthand where we can? Or just avoid canonicalizing?**
818
818
819
819
### Groups
820
820
821
-
#### Named
821
+
Named groups may be specified with a few different delimiters:
822
822
823
823
```
824
824
NamedGroup -> 'P<' GroupNameBody '>'
825
825
| '<' GroupNameBody '>'
826
826
| "'" GroupNameBody "'"
827
827
```
828
828
829
-
We intend on canonicalizing to the `(?<...>)`spelling.
829
+
The preferable spelling here will likely be influenced by the regex literal delimiter choice. `(?'...')` seems a reasonable preferred spelling in isolation, however not so much if `re'...'` is chosen as the delimiter. To reduce possible confusion for the parser as well as the user, `(?<...>)`would seem the more preferable syntax in that case. This would also likely affect the preferred syntax for references.
830
830
831
831
#### Lookaheads and lookbehinds
832
832
833
-
We intend on canonicalizing to the short-form versions of these group kinds, e.g `(?=`.
833
+
These have both shorthand spellings as well as more explicit PCRE2 spellings. While the more explicit spellings are definitely clearer, they can feel quite verbose. The short-form spellings e.g `(?=` seem more preferable due to their familiarity.
For absolute numeric references, we plan on choosing the canonical spelling `\DDD`, as it is unambiguous with octal sequences. For relative numbered references, as well as named references, we intend on canonicalizing to `\k<...>` to match the group name canonicalization `(?<...>)`. **TODO: How valuable is it to have canonical `\DDD`? Would it be better to just use `\k<...>` for everything?**
847
+
For absolute numeric references, `\DDD` seems to be a strong candidate for the preferred syntax due to its familiarity. For relative numbered references, as well as named references, `\k<...>` or `\k'...'` seem like the ideal choice (depending on the syntax chosen for named groups). This avoids the confusion between `\g{...}` and `\g<...>` referring to a backreference and subpattern respectively. It additionally avoids confusion with group syntax.
848
+
849
+
There may be value in choosing `\k` as the single unified syntax for backreferences (instead of `\DDD` for absolute numeric references), though there may be value in preserving the familiarity of `\DDD`.
We intend on canonicalizing to the `\g<...>`spelling. **TODO: For `(?R)`too?**
864
+
To avoid confusion with groups, `\g<...>`or `\g'...'` seem like the ideal preferred spellings (depending on the syntax chosen for named groups). There may however be value in preserving the `(?R)`spelling where it is used, instead of preferring e.g `\g<0>`.
863
865
864
866
### Conditional references
865
867
@@ -874,7 +876,7 @@ KnownCondition -> 'R'
874
876
| NumberRef
875
877
```
876
878
877
-
For named references in a group condition, there is a choice between `(?('name'))` and `(?(<name>))`. We intend on canonicalizing to `(?(<name>))` to match the group name canonicalization.
879
+
For named references in a group condition, there is a choice between `(?('name'))` and `(?(<name>))`. The preferred syntax in this case would likely reflect the syntax chosen for named groups.
PCRE accepts a number of alternative delimiters for callout string arguments. We intend to canonicalize to `(?C"...")`. **TODO: May want to alter if we choose `r"..."`, though lexing should be able to handle it by looking for the `(?C` prefix**.
895
-
896
+
PCRE accepts a number of alternative delimiters for callout string arguments. The `(?C"...")` syntax seems preferable due to its consistency with string literal syntax. However it may be necessary to prefer `(?C'...')` depending on whether the regex literal delimiter ends up involving double quotes e.g `re"..."`.
0 commit comments