Skip to content

Commit 31c5cf5

Browse files
committed
Update RegexSyntax.md
1 parent b83c4e7 commit 31c5cf5

File tree

1 file changed

+12
-10
lines changed

1 file changed

+12
-10
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -637,13 +637,13 @@ In a custom character class, some engines allow for binary set operations that t
637637

638638
[UTS#18][uts18] requires intersection and subtraction, and uses the operation spellings `&&` and `--` in its examples, though it doesn't mandate a particular spelling. In particular, conforming implementations could spell the subtraction `[[x]--[y]]` as `[[x]&&[^y]]`. UTS#18 also suggests a symmetric difference operator `~~`, and uses an explicit `||` operator in examples, though doesn't require either.
639639

640-
The differing support between engines is conflicting, as engines that don't support a particular operator treat them as literal, e.g `[x&&y]` in PCRE is the character class of `["x", "&", "y"]` rather than an intersection.
640+
Engines that don't support a particular operator fallback to treating it as literal, e.g `[x&&y]` in PCRE is the character class of `["x", "&", "y"]` rather than an intersection.
641641

642-
Another conflict arises with .NET's support of using the `-` character in a custom character class to denote both a range as well as a set subtraction. .NET disambiguates this by only permitting its use as a subtraction if the right hand operand is a nested custom character class, otherwise it is a range. This conflicts with e.g ICU where `[x-[y]]`, in which the `-` is treated as literal.
642+
Unlike other engines, .NET supports the use of `-` to denote both a range as well as a set subtraction. .NET disambiguates this by only permitting its use as a subtraction if the right hand operand is a nested custom character class, otherwise it is a range operator. This conflicts with e.g ICU where `[x-[y]]`, in which the `-` is treated as literal.
643643

644644
We intend to support the operators `&&`, `--`, and `~~`. This means that any regex literal containing these sequences in a custom character class while being written for an engine not supporting that operation will have a different semantic meaning in our engine. However this ought not to be a common occurrence, as specifying a character multiple times in a custom character class is redundant.
645645

646-
In the interests of compatibility, we also intend on supporting the `-` operator, though we likely want to emit a warning and encourage users to switch to `--`.
646+
In the interests of compatibility, we also intend on supporting the `-` operator, though we will likely want to emit a warning and encourage users to switch to `--`.
647647

648648
### Nested custom character classes
649649

@@ -660,6 +660,8 @@ PCRE does not support this feature, and as such treats `]` as the closing charac
660660

661661
.NET does not support nested character classes in general, although allows them as the right-hand side of a subtraction operation.
662662

663+
We intend on permitting nested custom character classes.
664+
663665
### `\U`
664666

665667
In PCRE, if `PCRE2_ALT_BSUX` or `PCRE2_EXTRA_ALT_BSUX` are specified, `\U` matches literal `U`. However in ICU, `\Uhhhhhhhh` matches a hex sequence. We intend on following the ICU behavior.
@@ -672,9 +674,9 @@ This quantifier is supported by Oniguruma, but in PCRE it matches the literal ch
672674

673675
This syntax is implemented in a variety of different ways depending on the engine. In ICU and Java, it is always a backreference unless prefixed with `0`, in which case it is an octal sequence.
674676

675-
In PCRE, Oniguruma, and .NET, it is also always an octal sequence if prefixed with `0`, however there are also other conditions where it may be treated as octal. These conditions vary slightly been the engines. In PCRE, it will be treated as backreference if any of the following hold:
677+
In PCRE, Oniguruma, and .NET, it is also always an octal sequence if prefixed with `0`, however there are other cases where it may be treated as octal. These cases vary slightly between the engines. In PCRE, it will be treated as backreference if any of the following hold:
676678

677-
- Its `0 < n < 10`.
679+
- Its value is `0 < n < 10`.
678680
- Its first digit is `8` or `9`.
679681
- Its value corresponds to a valid *prior* group number.
680682

@@ -684,23 +686,23 @@ Oniguruma follows all of these except the second. If the first digit is `8` or `
684686

685687
We intend to implement a simpler behavior more inline with ICU and Java. A `\DDD` sequence that does not start with a `0` will be treated as a backreference, otherwise it will be treated as an octal sequence. If an invalid backreference is formed with this syntax, we will suggest prefixing with a `0` if an octal sequence is desired.
686688

687-
One further difference between engines exists with this syntax in the octal sequence case. In ICU, up to 3 additional digits are read after the `0`. In PCRE, only 2 additional digits may be interpreted as octal, the last is literal. We intend to follow the ICU behavior, as it is necessary when requiring a `0` prefix.
689+
One further difference exists between engines in the octal sequence case. In ICU, up to 3 additional digits are read after the `0`. In PCRE, only 2 additional digits may be interpreted as octal, the last is literal. We intend to follow the ICU behavior, as it is necessary when requiring a `0` prefix.
688690

689691
### `\x`
690692

691693
In PCRE, a bare `\x` denotes the NUL character (`U+00`). In Oniguruma, it denotes literal `x`. We intend on following the PCRE behavior.
692694

693695
### Whitespace in ranges
694696

695-
In PCRE, `x{2,4}` is a range quantifier meaning that `x` can be matched from 2 to 4 times. However if any whitespace is introduced within the braces, it becomes an invalid range and is then treated as the literal characters instead. We find this behavior to be unintuitive, and therefore intend to parse any intermixed whitespace in the range.
697+
In PCRE, `x{2,4}` is a range quantifier meaning that `x` can be matched from 2 to 4 times. However if any whitespace is introduced within the braces e.g `x{2, 4}`, it becomes an invalid range and is then treated as the literal characters instead. We find this behavior to be unintuitive, and therefore intend to parse any intermixed whitespace in the range.
696698

697699
### Implicitly-scoped matching option scopes
698700

699701
PCRE and Oniguruma both support changing the active matching options through an isolated group e.g `(?i)`. However, they have differing semantics when it comes to their scoping. In Oniguruma, it is treated as an implicit new scope that wraps everything until the end of the current group. In PCRE, it is treated as changing the matching option for all the following expressions until the end of the group.
700702

701703
These sound similar, but have different semantics around alternations, e.g for `a(?i)b|c|d`, in Oniguruma this becomes `a(?i:b|c|d)`, where `a` is no longer part of the alternation. However in PCRE it becomes `a(?i:b)|(?i:c)|(?i:d)`, where `a` remains a child of the alternation.
702704

703-
We aim to support the PCRE behavior.
705+
We intend on matching the PCRE behavior.
704706

705707
### Backreference condition kinds
706708

@@ -730,7 +732,7 @@ As such we feel that the more desirable default behavior of shorthand script pro
730732

731733
### Extended syntax modes
732734

733-
Various regex engines offer an "extended syntax" where whitespace is treated as non-semantic (e.g `a b c` is equivalent to `abc`), in addition to allowing end-of-line comments `# comment`. In PCRE, this enabled through the `(?x)` and `(?xx)` matching options, where the former allows non-semantic whitespace outside of character classes, and the latter also allows non-semantic whitespace in custom character classes.
735+
Various regex engines offer an "extended syntax" where whitespace is treated as non-semantic (e.g `a b c` is equivalent to `abc`), in addition to allowing end-of-line comments `# comment`. In PCRE, this is enabled through the `(?x)` and `(?xx)` matching options, where the former allows non-semantic whitespace outside of character classes, and the latter also allows non-semantic whitespace in custom character classes.
734736

735737
Oniguruma, Java, and ICU however enable the more broad behavior under `(?x)`. We therefore intend to follow this behavior, with `(?x)` and `(?xx)` being treated the same.
736738

@@ -744,7 +746,7 @@ Different regex engines also have different rules around what characters are con
744746
- Line separator `U+2028`
745747
- Paragraph separator `U+2029`
746748

747-
This is a subset of the scalars matched by `UnicodeScalar.isWhitespace`. Additionally, in a custom character class, PCRE only considers the space and tab characters as whitespace. Other engines do not differentiate between whitespace characters inside and outside custom character classes, and appear to follow a subset of this list. Therefore we intend to support exactly the characters in this list for the purposes of non-semantic whitespace.
749+
This is a subset of the scalars matched by `UnicodeScalar.isWhitespace`. Additionally, in a custom character class, PCRE only considers the space and tab characters as whitespace. Other engines do not differentiate between whitespace characters inside and outside custom character classes, and appear to follow a subset of this list. Therefore we intend to support exactly the characters in this list for the purposes of non-semantic whitespace parsing.
748750

749751
### Group numbering
750752

0 commit comments

Comments
 (0)