Skip to content

Commit 5a2517f

Browse files
committed
Update RegexSyntax.md
1 parent fb7f380 commit 5a2517f

File tree

1 file changed

+10
-10
lines changed

1 file changed

+10
-10
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ This proposal-component focuses on the interior syntax, which is large enough fo
1414

1515
## Motivation
1616

17-
Swift aims to be a pragmatic programming language, balancing (TODO: prose). Rather than pursue a novel interior syntax, (TODO: prose).
17+
Swift aims to be a pragmatic programming language, balancing (**TODO(Michael)**: prose). Rather than pursue a novel interior syntax, (**TODO(Michael)**: prose).
1818

1919
Regex interior syntax is part of a larger [proposal](https://forums.swift.org/t/pitch-regular-expression-literals/52820), which in turn is part of a larger [string processing effort](https://forums.swift.org/t/declarative-string-processing-overview/52459).
2020

@@ -502,7 +502,6 @@ KnownCondition -> 'R'
502502
| 'DEFINE'
503503
| 'VERSION' VersionCheck
504504
| NumberRef
505-
| NameRef
506505
507506
PCREVersionCheck -> '>'? '=' PCREVersionNumber
508507
PCREVersionNumber -> <Int> '.' <Int>
@@ -512,11 +511,12 @@ A conditional evaluates a particular condition, and chooses a branch to match ag
512511

513512
A condition may be:
514513

515-
- A reference to a capture group, which checks whether the group matched successfully.
514+
- A numeric or delimited named reference to a capture group, which checks whether the group matched successfully.
516515
- A recursion check on either a particular group or the entire regex. In the former case, this checks to see if the last recursive call is through that group. In the latter case, it checks if the match is currently taking place in any kind of recursive call.
517-
- An arbitrary recursive regular expression, which is matched against, and evaluates to true if the match is successful. It may contain capture groups that add captures to the match.
518516
- A PCRE version check.
519517

518+
If the condition does not syntactically match any of the above, it is treated as an arbitrary recursive regular expression. This will be matched against, and evaluates to true if the match is successful. It may contain capture groups that add captures to the match.
519+
520520
The `DEFINE` keyword is not used as a condition, but rather a way in which to define a group which is not evaluated, but may be referenced by a subpattern.
521521

522522
### PCRE backtracking directives
@@ -616,9 +616,9 @@ An absent function is an Oniguruma feature that allows for the easy inversion of
616616

617617
## Syntactic differences between engines
618618

619-
**TODO: Intro**
619+
**TODO(Michael, if you want): Intro**
620620

621-
**TODO: Talk about compatibility modes for different engines being a possible future direction?**
621+
**TODO(Michael, if you want): Talk about compatibility modes for different engines being a possible future direction?**
622622

623623
### Character class set operations
624624

@@ -696,15 +696,15 @@ We aim to support the PCRE behavior.
696696

697697
### Backreference condition kinds
698698

699-
PCRE and .NET allow for conditional patterns to reference a group by its name, e.g:
699+
PCRE and .NET allow for conditional patterns to reference a group by its name without any form of delimiter, e.g:
700700

701701
```
702702
(?<group1>x)?(?(group1)y)
703703
```
704704

705-
where `y` will only be matched if `(?<group1>x)` was matched. PCRE will always treat such syntax as a backreference condition, however .NET will only treat it as such if a group with that name exists somewhere in the regex (including after the conditional). Otherwise, .NET interprets `group1` as an arbitrary regular expression condition to try match against.
705+
where `y` will only be matched if `(?<group1>x)` was matched. PCRE will always treat such syntax as a backreference condition, however .NET will only treat it as such if a group with that name exists somewhere in the regex (including after the conditional). Otherwise, .NET interprets `group1` as an arbitrary regular expression condition to try match against. Oniguruma on the other hand will always treat `group1` as an regex condition to match against.
706706

707-
We intend to always parse such conditions as an arbitrary regular expression condition, and will emit a warning asking users to explicitly use the syntax `(?(<group1>)y)` if they want a backreference condition. This more explicit syntax is supported by PCRE. **TODO: Is the opposite more common?**
707+
We intend to always parse such conditions as an arbitrary regular expression condition, and will emit a warning asking users to explicitly use the syntax `(?(<group1>)y)` if they want a backreference condition. This more explicit syntax is supported by both PCRE and Oniguruma.
708708

709709
### `\N`
710710

@@ -716,7 +716,7 @@ ICU unifies the character property syntax `\p{...}` with the syntax for POSIX ch
716716

717717
### Script properties
718718

719-
Shorthand script property syntax e.g `\p{Latin}` is treated as `\p{Script=Latin}` by PCRE, ICU, Oniguruma, and Java. These use [the Unicode Script property][unicode-scripts], which assigns each scalar a particular script value. However, there are scalars that may appear in multiple scripts, e.g U+3003 DITTO MARK. These often get assigned to the `Common` script to reflect this fact, which is not particularly useful for matching purposes. To provide more fine-grained script matching, Unicode provides [the Script Extension property][unicode-script-extensions], which exposes the set of scripts that a scalar appears in.
719+
Shorthand script property syntax e.g `\p{Latin}` is treated as `\p{Script=Latin}` by PCRE, ICU, Oniguruma, and Java. These use [the Unicode Script property][unicode-scripts], which assigns each scalar a particular script value. However, there are scalars that may appear in multiple scripts, e.g U+3003 DITTO MARK. These are often assigned to the `Common` script to reflect this fact, which is not particularly useful for matching purposes. To provide more fine-grained script matching, Unicode provides [the Script Extension property][unicode-script-extensions], which exposes the set of scripts that a scalar appears in.
720720

721721
As such we feel that the more desirable default behavior of shorthand script property syntax e.g `\p{Latin}` is for it to be treated as `\p{Script_Extension=Latin}`. This matches Perl's default behavior. Plain script properties may still be written using the more explicit syntax e.g `\p{Script=Latin}` and `\p{sc=Latin}`.
722722

0 commit comments

Comments
 (0)