You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexSyntax.md
+19-10Lines changed: 19 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -301,23 +301,34 @@ Set -> Member+
301
301
Member -> CustomCharClass | Quote | Range | Atom
302
302
Range -> RangeElt `-` RangeElt
303
303
RangeElt -> <Char> | UniScalar | EscapeSequence
304
-
SetOp -> '&&' | '--' | '~~'
304
+
SetOp -> '&&' | '--' | '~~' | '-'
305
305
```
306
306
307
307
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal. The basic element in a custom character class is an `Atom`, though only a few atoms are considered valid:
308
308
309
-
- Builtin character classes, except `.`, `\O`, and `\X`
309
+
- Builtin character classes, except `.`, `\O`, `\X`, and `\N`.
310
310
- Escape sequences, including `\b` which becomes the backspace character (rather than a word boundary)
311
311
- Unicode scalars
312
312
- Named characters
313
313
- Character properties
314
314
- Plain literal characters
315
315
316
-
Ranges of characters may be specified with `-`, e.g `[a-z]` matches against the letters from `a` to `z`. Only unicode scalars and literal characters are valid range operands. If `-` does not appear in a valid position, it is interpreted as literal, e.g `[-a]` is the character class of `-` and`a`. **TODO: .NET's use of it for subtraction**
316
+
Atoms may be used to compose other character class members, including ranges, quoted sequences, and even nested custom character classes `[[ab]c\d]`. Adjacent members form an implicit union of character classes, e.g `[[ab]c\d]` is the union of the characters`a`, `b`, `c`, and digit characters.
317
317
318
-
Custom character classes may be nested within each other, and may be used with set operations. The supported set operations are intersection `&&`, subtraction `--`, and symmetric difference `~~`.
318
+
Quoted sequences may be used to escape the contained characters, e.g `[\Q]\E]` is the character class of the literal character `[`.
319
319
320
-
Quoted sequences may appear with custom character classes, e.g `[\Q]\E]`, and escape the contained characters.
320
+
Ranges of characters may be specified with `-`, e.g `[a-z]` matches against the letters from `a` to `z`. Only unicode scalars and literal characters are valid range operands. If `-` appears at the start or end of a custom character class, it is interpreted as literal, e.g `[-a-]` is the character class of `-` and `a`.
321
+
322
+
Operators may be used to apply set operations to character class members. The operators supported are:
323
+
324
+
-`&&`: Intersection of the LHS and RHS
325
+
-`--`: Subtraction of the RHS from the LHS
326
+
-`~~`: Symmetric difference of the RHS and LHS
327
+
-`-`: .NET's spelling of subtracting the RHS from the LHS.
328
+
329
+
These operators have a lower precedence than the implicit union of members, e.g `[ac-d&&a[d]]` is an intersection of the character classes `[ac-d]` and `[ad]`.
330
+
331
+
To avoid ambiguity between .NET's subtraction syntax and range syntax, .NET specifies that a subtraction will only be parsed if the right-hand-side is a nested custom character class. We intend to follow this behavior.
321
332
322
333
### Character properties
323
334
@@ -583,7 +594,7 @@ Another conflict arises with .NET's support of using the `-` character in a cust
583
594
584
595
We intend to support the operators `&&`, `--`, and `~~`. This means that any regex literal containing these sequences in a custom character class while being written for an engine not supporting that operation will have a different semantic meaning in our engine. However this ought not to be a common occurrence, as specifying a character multiple times in a custom character class is redundant.
585
596
586
-
We also intend on supporting the `-` operator (**TODO: Justify**).
597
+
In the interests of compatibility, we also intend on supporting the `-` operator, though we likely want to emit a warning and encourage users to switch to `--`.
587
598
588
599
### Nested custom character classes
589
600
@@ -639,7 +650,7 @@ PCRE and Oniguruma both support changing the active matching options through an
639
650
640
651
These sound similar, but have different semantics around alternations, e.g for `a(?i)b|c|d`, in Oniguruma this becomes `a(?i:b|c|d)`, where `a` is no longer part of the alternation. However in PCRE it becomes `a(?i:b)|(?i:c)|(?i:d)`, where `a` remains a child of the alternation.
641
652
642
-
We aim to support the Oniguruma behavior. **TODO: The PCRE behavior is more complex for the parser, but seems less surprising, maybe that should become the default?**
653
+
We aim to support the PCRE behavior.
643
654
644
655
### Backreference condition kinds
645
656
@@ -659,9 +670,7 @@ PCRE supports `\N` meaning "not a newline", however there are engines that treat
659
670
660
671
### Extended character property syntax
661
672
662
-
**TODO: Can this be conflicting?**
663
-
664
-
ICU (**TODO: any others?**) unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`, such that they follow the same internal grammar, which allows referencing any Unicode character property in addition to the POSIX properties.
673
+
ICU unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`, such that they follow the same internal grammar, which allows referencing any Unicode character property in addition to the POSIX properties. We intend to support this, though it is a purely additive feature, and therefore should not conflict with regex engines that implement a more limited POSIX syntax.
0 commit comments