Skip to content

Commit dbdef62

Browse files
committed
Update RegexSyntax.md
1 parent 93cc5ca commit dbdef62

File tree

1 file changed

+19
-10
lines changed

1 file changed

+19
-10
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 19 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -301,23 +301,34 @@ Set -> Member+
301301
Member -> CustomCharClass | Quote | Range | Atom
302302
Range -> RangeElt `-` RangeElt
303303
RangeElt -> <Char> | UniScalar | EscapeSequence
304-
SetOp -> '&&' | '--' | '~~'
304+
SetOp -> '&&' | '--' | '~~' | '-'
305305
```
306306

307307
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal. The basic element in a custom character class is an `Atom`, though only a few atoms are considered valid:
308308

309-
- Builtin character classes, except `.`, `\O`, and `\X`
309+
- Builtin character classes, except `.`, `\O`, `\X`, and `\N`.
310310
- Escape sequences, including `\b` which becomes the backspace character (rather than a word boundary)
311311
- Unicode scalars
312312
- Named characters
313313
- Character properties
314314
- Plain literal characters
315315

316-
Ranges of characters may be specified with `-`, e.g `[a-z]` matches against the letters from `a` to `z`. Only unicode scalars and literal characters are valid range operands. If `-` does not appear in a valid position, it is interpreted as literal, e.g `[-a]` is the character class of `-` and `a`. **TODO: .NET's use of it for subtraction**
316+
Atoms may be used to compose other character class members, including ranges, quoted sequences, and even nested custom character classes `[[ab]c\d]`. Adjacent members form an implicit union of character classes, e.g `[[ab]c\d]` is the union of the characters `a`, `b`, `c`, and digit characters.
317317

318-
Custom character classes may be nested within each other, and may be used with set operations. The supported set operations are intersection `&&`, subtraction `--`, and symmetric difference `~~`.
318+
Quoted sequences may be used to escape the contained characters, e.g `[\Q]\E]` is the character class of the literal character `[`.
319319

320-
Quoted sequences may appear with custom character classes, e.g `[\Q]\E]`, and escape the contained characters.
320+
Ranges of characters may be specified with `-`, e.g `[a-z]` matches against the letters from `a` to `z`. Only unicode scalars and literal characters are valid range operands. If `-` appears at the start or end of a custom character class, it is interpreted as literal, e.g `[-a-]` is the character class of `-` and `a`.
321+
322+
Operators may be used to apply set operations to character class members. The operators supported are:
323+
324+
- `&&`: Intersection of the LHS and RHS
325+
- `--`: Subtraction of the RHS from the LHS
326+
- `~~`: Symmetric difference of the RHS and LHS
327+
- `-`: .NET's spelling of subtracting the RHS from the LHS.
328+
329+
These operators have a lower precedence than the implicit union of members, e.g `[ac-d&&a[d]]` is an intersection of the character classes `[ac-d]` and `[ad]`.
330+
331+
To avoid ambiguity between .NET's subtraction syntax and range syntax, .NET specifies that a subtraction will only be parsed if the right-hand-side is a nested custom character class. We intend to follow this behavior.
321332

322333
### Character properties
323334

@@ -583,7 +594,7 @@ Another conflict arises with .NET's support of using the `-` character in a cust
583594

584595
We intend to support the operators `&&`, `--`, and `~~`. This means that any regex literal containing these sequences in a custom character class while being written for an engine not supporting that operation will have a different semantic meaning in our engine. However this ought not to be a common occurrence, as specifying a character multiple times in a custom character class is redundant.
585596

586-
We also intend on supporting the `-` operator (**TODO: Justify**).
597+
In the interests of compatibility, we also intend on supporting the `-` operator, though we likely want to emit a warning and encourage users to switch to `--`.
587598

588599
### Nested custom character classes
589600

@@ -639,7 +650,7 @@ PCRE and Oniguruma both support changing the active matching options through an
639650

640651
These sound similar, but have different semantics around alternations, e.g for `a(?i)b|c|d`, in Oniguruma this becomes `a(?i:b|c|d)`, where `a` is no longer part of the alternation. However in PCRE it becomes `a(?i:b)|(?i:c)|(?i:d)`, where `a` remains a child of the alternation.
641652

642-
We aim to support the Oniguruma behavior. **TODO: The PCRE behavior is more complex for the parser, but seems less surprising, maybe that should become the default?**
653+
We aim to support the PCRE behavior.
643654

644655
### Backreference condition kinds
645656

@@ -659,9 +670,7 @@ PCRE supports `\N` meaning "not a newline", however there are engines that treat
659670

660671
### Extended character property syntax
661672

662-
**TODO: Can this be conflicting?**
663-
664-
ICU (**TODO: any others?**) unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`, such that they follow the same internal grammar, which allows referencing any Unicode character property in addition to the POSIX properties.
673+
ICU unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`, such that they follow the same internal grammar, which allows referencing any Unicode character property in addition to the POSIX properties. We intend to support this, though it is a purely additive feature, and therefore should not conflict with regex engines that implement a more limited POSIX syntax.
665674

666675
### Extended syntax modes
667676

0 commit comments

Comments
 (0)