Skip to content

Commit 11bb57d

Browse files
committed
Update RegexSyntax.md
1 parent cc314a2 commit 11bb57d

File tree

1 file changed

+10
-9
lines changed

1 file changed

+10
-9
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -814,23 +814,23 @@ There are multiple equivalent ways of spelling the same the Unicode scalar value
814814

815815
Character properties `\p{...}` have a variety of alternative spellings due to fuzzy matching, Unicode aliases, and shorthand syntax for common Unicode properties. They also may be written using POSIX syntax e.g `[:gc=Whitespace:]`.
816816

817-
**TODO: Should we canonicalize on e.g `\p{Script_Extensions=Greek}`? Or prefer the shorthand where we can? Or just avoid canonicalizing?**
817+
**TODO: Should we suggest canonicalizing on e.g `\p{Script_Extensions=Greek}`? Or prefer the shorthand where we can? Or just avoid canonicalizing?**
818818

819819
### Groups
820820

821-
#### Named
821+
Named groups may be specified with a few different delimiters:
822822

823823
```
824824
NamedGroup -> 'P<' GroupNameBody '>'
825825
| '<' GroupNameBody '>'
826826
| "'" GroupNameBody "'"
827827
```
828828

829-
We intend on canonicalizing to the `(?<...>)` spelling.
829+
The preferable spelling here will likely be influenced by the regex literal delimiter choice. `(?'...')` seems a reasonable preferred spelling in isolation, however not so much if `re'...'` is chosen as the delimiter. To reduce possible confusion for the parser as well as the user, `(?<...>)` would seem the more preferable syntax in that case. This would also likely affect the preferred syntax for references.
830830

831831
#### Lookaheads and lookbehinds
832832

833-
We intend on canonicalizing to the short-form versions of these group kinds, e.g `(?=`.
833+
These have both shorthand spellings as well as more explicit PCRE2 spellings. While the more explicit spellings are definitely clearer, they can feel quite verbose. The short-form spellings e.g `(?=` seem more preferable due to their familiarity.
834834

835835
### Backreferences
836836

@@ -844,7 +844,9 @@ Backreference -> '\g{' NamedOrNumberRef '}'
844844
| '(?P=' NamedRef ')'
845845
```
846846

847-
For absolute numeric references, we plan on choosing the canonical spelling `\DDD`, as it is unambiguous with octal sequences. For relative numbered references, as well as named references, we intend on canonicalizing to `\k<...>` to match the group name canonicalization `(?<...>)`. **TODO: How valuable is it to have canonical `\DDD`? Would it be better to just use `\k<...>` for everything?**
847+
For absolute numeric references, `\DDD` seems to be a strong candidate for the preferred syntax due to its familiarity. For relative numbered references, as well as named references, `\k<...>` or `\k'...'` seem like the ideal choice (depending on the syntax chosen for named groups). This avoids the confusion between `\g{...}` and `\g<...>` referring to a backreference and subpattern respectively. It additionally avoids confusion with group syntax.
848+
849+
There may be value in choosing `\k` as the single unified syntax for backreferences (instead of `\DDD` for absolute numeric references), though there may be value in preserving the familiarity of `\DDD`.
848850

849851
### Subpatterns
850852

@@ -859,7 +861,7 @@ GroupLikeSubpatternBody -> 'P>' NamedRef
859861
| NumberRef
860862
```
861863

862-
We intend on canonicalizing to the `\g<...>` spelling. **TODO: For `(?R)` too?**
864+
To avoid confusion with groups, `\g<...>` or `\g'...'` seem like the ideal preferred spellings (depending on the syntax chosen for named groups). There may however be value in preserving the `(?R)` spelling where it is used, instead of preferring e.g `\g<0>`.
863865

864866
### Conditional references
865867

@@ -874,7 +876,7 @@ KnownCondition -> 'R'
874876
| NumberRef
875877
```
876878

877-
For named references in a group condition, there is a choice between `(?('name'))` and `(?(<name>))`. We intend on canonicalizing to `(?(<name>))` to match the group name canonicalization.
879+
For named references in a group condition, there is a choice between `(?('name'))` and `(?(<name>))`. The preferred syntax in this case would likely reflect the syntax chosen for named groups.
878880

879881
### PCRE Callouts
880882

@@ -891,8 +893,7 @@ PCRECalloutBody -> '' | <Number>
891893
| '{' <String> '}'
892894
```
893895

894-
PCRE accepts a number of alternative delimiters for callout string arguments. We intend to canonicalize to `(?C"...")`. **TODO: May want to alter if we choose `r"..."`, though lexing should be able to handle it by looking for the `(?C` prefix**.
895-
896+
PCRE accepts a number of alternative delimiters for callout string arguments. The `(?C"...")` syntax seems preferable due to its consistency with string literal syntax. However it may be necessary to prefer `(?C'...')` depending on whether the regex literal delimiter ends up involving double quotes e.g `re"..."`.
896897

897898
## Alternatives Considered
898899

0 commit comments

Comments
 (0)