Skip to content

Commit b83c4e7

Browse files
committed
Update RegexSyntax.md
1 parent 74acaa8 commit b83c4e7

File tree

1 file changed

+6
-2
lines changed

1 file changed

+6
-2
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -620,9 +620,11 @@ An absent function is an Oniguruma feature that allows for the easy inversion of
620620

621621
## Syntactic differences between engines
622622

623-
The above regular expression grammar covers a superset of the syntax accepted by PCRE, ICU, Oniguruma, .NET, and Java. However there are cases where the same syntax is parsed differently by these engines. This section provides a summary of these differences, and specifies the interpretation that will be made by the Swift regex parser.
623+
The proposed "syntactic superset" introduces some minor ambiguities, as each engine supports a slightly different set of features. When a particular engine's parser sees a feature it doesn't support, it typically has a fall-back behavior, such as treating the unknown feature as literal contents.
624624

625-
These differences inherently mean that our default parser behavior cannot be fully compatible with these other engines. However, this would not preclude the potential future implementation of different compatibility modes for different engines in which we support their parsing behavior of certain syntax.
625+
Explicit compatibility modes, i.e. precisely mimicking emergent behavior from a specific engine's parser, is deferred as future work from this proposal. Conversion from this "syntactic superset" to a particular engine's syntax (e.g. as an AST "pretty printer") is deferred as future work from this proposal.
626+
627+
Below is an exhaustive treatment of every syntactic ambiguity we have encountered.
626628

627629
### Character class set operations
628630

@@ -632,6 +634,7 @@ In a custom character class, some engines allow for binary set operations that t
632634
|------|-----|--------|-----------|------|------|
633635
|| Intersection `&&`, Subtraction `--` | Intersection, Subtraction | Intersection `&&` | Subtraction via `-` | Intersection `&&` |
634636

637+
635638
[UTS#18][uts18] requires intersection and subtraction, and uses the operation spellings `&&` and `--` in its examples, though it doesn't mandate a particular spelling. In particular, conforming implementations could spell the subtraction `[[x]--[y]]` as `[[x]&&[^y]]`. UTS#18 also suggests a symmetric difference operator `~~`, and uses an explicit `||` operator in examples, though doesn't require either.
636639

637640
The differing support between engines is conflicting, as engines that don't support a particular operator treat them as literal, e.g `[x&&y]` in PCRE is the character class of `["x", "&", "y"]` rather than an intersection.
@@ -650,6 +653,7 @@ This allows e.g `[[a]b[c]]`, which is interpreted the same as `[abc]`. It also a
650653
|------|-----|--------|-----------|------|------|
651654
||| 💡 ||||
652655

656+
653657
UTS#18 doesn't require this, though it does suggest it as a way to clarify precedence for chains of character class set operations e.g `[\w--\d&&\s]`, which the user could write as `[[\w--\d]&&\s]`.
654658

655659
PCRE does not support this feature, and as such treats `]` as the closing character of the custom character class. Therefore `[[a]b[c]]` is interpreted as the character class `["[", "a"]`, followed by literal `b`, and then the character class `["c"]`, followed by literal `]`.

0 commit comments

Comments
 (0)