You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexSyntax.md
+6-2Lines changed: 6 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -620,9 +620,11 @@ An absent function is an Oniguruma feature that allows for the easy inversion of
620
620
621
621
## Syntactic differences between engines
622
622
623
-
The above regular expression grammar covers a superset of the syntax accepted by PCRE, ICU, Oniguruma, .NET, and Java. However there are cases where the same syntax is parsed differently by these engines. This section provides a summary of these differences, and specifies the interpretation that will be made by the Swift regex parser.
623
+
The proposed "syntactic superset" introduces some minor ambiguities, as each engine supports a slightly different set of features. When a particular engine's parser sees a feature it doesn't support, it typically has a fall-back behavior, such as treating the unknown feature as literal contents.
624
624
625
-
These differences inherently mean that our default parser behavior cannot be fully compatible with these other engines. However, this would not preclude the potential future implementation of different compatibility modes for different engines in which we support their parsing behavior of certain syntax.
625
+
Explicit compatibility modes, i.e. precisely mimicking emergent behavior from a specific engine's parser, is deferred as future work from this proposal. Conversion from this "syntactic superset" to a particular engine's syntax (e.g. as an AST "pretty printer") is deferred as future work from this proposal.
626
+
627
+
Below is an exhaustive treatment of every syntactic ambiguity we have encountered.
626
628
627
629
### Character class set operations
628
630
@@ -632,6 +634,7 @@ In a custom character class, some engines allow for binary set operations that t
[UTS#18][uts18] requires intersection and subtraction, and uses the operation spellings `&&` and `--` in its examples, though it doesn't mandate a particular spelling. In particular, conforming implementations could spell the subtraction `[[x]--[y]]` as `[[x]&&[^y]]`. UTS#18 also suggests a symmetric difference operator `~~`, and uses an explicit `||` operator in examples, though doesn't require either.
636
639
637
640
The differing support between engines is conflicting, as engines that don't support a particular operator treat them as literal, e.g `[x&&y]` in PCRE is the character class of `["x", "&", "y"]` rather than an intersection.
@@ -650,6 +653,7 @@ This allows e.g `[[a]b[c]]`, which is interpreted the same as `[abc]`. It also a
650
653
|------|-----|--------|-----------|------|------|
651
654
| ❌ | ✅ | 💡 | ✅ | ❌ | ✅ |
652
655
656
+
653
657
UTS#18 doesn't require this, though it does suggest it as a way to clarify precedence for chains of character class set operations e.g `[\w--\d&&\s]`, which the user could write as `[[\w--\d]&&\s]`.
654
658
655
659
PCRE does not support this feature, and as such treats `]` as the closing character of the custom character class. Therefore `[[a]b[c]]` is interpreted as the character class `["[", "a"]`, followed by literal `b`, and then the character class `["c"]`, followed by literal `]`.
0 commit comments