You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexSyntax.md
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ This proposal-component focuses on the interior syntax, which is large enough fo
14
14
15
15
## Motivation
16
16
17
-
Swift aims to be a pragmatic programming language, balancing (TODO: prose). Rather than pursue a novel interior syntax, (TODO: prose).
17
+
Swift aims to be a pragmatic programming language, balancing (**TODO(Michael)**: prose). Rather than pursue a novel interior syntax, (**TODO(Michael)**: prose).
18
18
19
19
Regex interior syntax is part of a larger [proposal](https://forums.swift.org/t/pitch-regular-expression-literals/52820), which in turn is part of a larger [string processing effort](https://forums.swift.org/t/declarative-string-processing-overview/52459).
20
20
@@ -502,7 +502,6 @@ KnownCondition -> 'R'
502
502
| 'DEFINE'
503
503
| 'VERSION' VersionCheck
504
504
| NumberRef
505
-
| NameRef
506
505
507
506
PCREVersionCheck -> '>'? '=' PCREVersionNumber
508
507
PCREVersionNumber -> <Int> '.' <Int>
@@ -512,11 +511,12 @@ A conditional evaluates a particular condition, and chooses a branch to match ag
512
511
513
512
A condition may be:
514
513
515
-
- A reference to a capture group, which checks whether the group matched successfully.
514
+
- A numeric or delimited named reference to a capture group, which checks whether the group matched successfully.
516
515
- A recursion check on either a particular group or the entire regex. In the former case, this checks to see if the last recursive call is through that group. In the latter case, it checks if the match is currently taking place in any kind of recursive call.
517
-
- An arbitrary recursive regular expression, which is matched against, and evaluates to true if the match is successful. It may contain capture groups that add captures to the match.
518
516
- A PCRE version check.
519
517
518
+
If the condition does not syntactically match any of the above, it is treated as an arbitrary recursive regular expression. This will be matched against, and evaluates to true if the match is successful. It may contain capture groups that add captures to the match.
519
+
520
520
The `DEFINE` keyword is not used as a condition, but rather a way in which to define a group which is not evaluated, but may be referenced by a subpattern.
521
521
522
522
### PCRE backtracking directives
@@ -616,9 +616,9 @@ An absent function is an Oniguruma feature that allows for the easy inversion of
616
616
617
617
## Syntactic differences between engines
618
618
619
-
**TODO: Intro**
619
+
**TODO(Michael, if you want): Intro**
620
620
621
-
**TODO: Talk about compatibility modes for different engines being a possible future direction?**
621
+
**TODO(Michael, if you want): Talk about compatibility modes for different engines being a possible future direction?**
622
622
623
623
### Character class set operations
624
624
@@ -696,15 +696,15 @@ We aim to support the PCRE behavior.
696
696
697
697
### Backreference condition kinds
698
698
699
-
PCRE and .NET allow for conditional patterns to reference a group by its name, e.g:
699
+
PCRE and .NET allow for conditional patterns to reference a group by its name without any form of delimiter, e.g:
700
700
701
701
```
702
702
(?<group1>x)?(?(group1)y)
703
703
```
704
704
705
-
where `y` will only be matched if `(?<group1>x)` was matched. PCRE will always treat such syntax as a backreference condition, however .NET will only treat it as such if a group with that name exists somewhere in the regex (including after the conditional). Otherwise, .NET interprets `group1` as an arbitrary regular expression condition to try match against.
705
+
where `y` will only be matched if `(?<group1>x)` was matched. PCRE will always treat such syntax as a backreference condition, however .NET will only treat it as such if a group with that name exists somewhere in the regex (including after the conditional). Otherwise, .NET interprets `group1` as an arbitrary regular expression condition to try match against. Oniguruma on the other hand will always treat `group1` as an regex condition to match against.
706
706
707
-
We intend to always parse such conditions as an arbitrary regular expression condition, and will emit a warning asking users to explicitly use the syntax `(?(<group1>)y)` if they want a backreference condition. This more explicit syntax is supported by PCRE. **TODO: Is the opposite more common?**
707
+
We intend to always parse such conditions as an arbitrary regular expression condition, and will emit a warning asking users to explicitly use the syntax `(?(<group1>)y)` if they want a backreference condition. This more explicit syntax is supported by both PCRE and Oniguruma.
708
708
709
709
### `\N`
710
710
@@ -716,7 +716,7 @@ ICU unifies the character property syntax `\p{...}` with the syntax for POSIX ch
716
716
717
717
### Script properties
718
718
719
-
Shorthand script property syntax e.g `\p{Latin}` is treated as `\p{Script=Latin}` by PCRE, ICU, Oniguruma, and Java. These use [the Unicode Script property][unicode-scripts], which assigns each scalar a particular script value. However, there are scalars that may appear in multiple scripts, e.g U+3003 DITTO MARK. These often get assigned to the `Common` script to reflect this fact, which is not particularly useful for matching purposes. To provide more fine-grained script matching, Unicode provides [the Script Extension property][unicode-script-extensions], which exposes the set of scripts that a scalar appears in.
719
+
Shorthand script property syntax e.g `\p{Latin}` is treated as `\p{Script=Latin}` by PCRE, ICU, Oniguruma, and Java. These use [the Unicode Script property][unicode-scripts], which assigns each scalar a particular script value. However, there are scalars that may appear in multiple scripts, e.g U+3003 DITTO MARK. These are often assigned to the `Common` script to reflect this fact, which is not particularly useful for matching purposes. To provide more fine-grained script matching, Unicode provides [the Script Extension property][unicode-script-extensions], which exposes the set of scripts that a scalar appears in.
720
720
721
721
As such we feel that the more desirable default behavior of shorthand script property syntax e.g `\p{Latin}` is for it to be treated as `\p{Script_Extension=Latin}`. This matches Perl's default behavior. Plain script properties may still be written using the more explicit syntax e.g `\p{Script=Latin}` and `\p{sc=Latin}`.
0 commit comments