You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexSyntax.md
+16-10Lines changed: 16 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -100,7 +100,7 @@ Atom -> Anchor
100
100
| Callout
101
101
| CharacterProperty
102
102
| EscapeSequence
103
-
| NamedCharacter
103
+
| NamedScalar
104
104
| Subpattern
105
105
| UniScalar
106
106
| '\K'
@@ -266,7 +266,7 @@ HexDigit -> [0-9a-zA-Z]
266
266
OctalDigit -> [0-7]
267
267
```
268
268
269
-
These sequences define a unicode scalar value to be matched against. There is both syntax for specifying the scalar value in hex notation, as well as octal notation. Note that `\x` that is not followed by any hexadecimal digit characters is treated as `\0`, which matches PCRE's behavior.
269
+
These sequences define a unicode scalar value to be matched against. There is both syntax for specifying the scalar value in hex notation, as well as octal notation. Note that `\x`, when not followed by any hexadecimal digit characters, is treated as `\0`, matching PCRE's behavior.
270
270
271
271
### Escape sequences
272
272
@@ -325,7 +325,7 @@ Custom characters classes introduce their own language, in which most regular ex
325
325
- Builtin character classes except `.`, `\R`, `\O`, `\X`, `\C`, and `\N`.
326
326
- Escape sequences, including `\b` which becomes the backspace character (rather than a word boundary)
327
327
- Unicode scalars
328
-
- Named characters
328
+
- Named scalars
329
329
- Character properties
330
330
- Plain literal characters
331
331
@@ -378,7 +378,7 @@ Unicode properties consist of both a key and a value, e.g `General_Category=Whit
378
378
There are some Unicode properties where the key or value may be inferred. These include:
379
379
380
380
- General category properties e.g `\p{Whitespace}` is inferred as `\p{General_Category=Whitespace}`.
381
-
- Script properties e.g `\p{Greek}` is inferred as `\p{Script=Greek}`.**TODO: Infer as `\p{scx=Greek}` instead?**
381
+
- Script properties e.g `\p{Greek}` is inferred as `\p{Script_Extensions=Greek}`.
382
382
- Boolean properties that are inferred to have a `True` value, e.g `\p{Lowercase}` is inferred as `\p{Lowercase=True}`.
383
383
- Block properties that begin with the prefix `in`, e.g `\p{inBasicLatin}` is inferred to be `\p{Block=Basic_Latin}`.
384
384
@@ -391,17 +391,15 @@ For non-Unicode properties, only a value is required. These include:
391
391
392
392
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`.
393
393
394
-
### Named characters
394
+
### Named scalars
395
395
396
396
```
397
-
NamedCharacter -> '\N{' CharName '}'
398
-
CharName -> 'U+' HexDigit{1...8} | [\s\w-]+
397
+
NamedScalar -> '\N{' ScalarName '}'
398
+
ScalarName -> 'U+' HexDigit{1...8} | [\s\w-]+
399
399
```
400
400
401
401
Allows a specific Unicode scalar to be specified by name or hexadecimal code point.
402
402
403
-
**TODO: Should this be called "named scalar" or similar?**
404
-
405
403
### Trivia
406
404
407
405
```
@@ -686,7 +684,7 @@ In PCRE, a bare `\x` denotes the NUL character (`U+00`). In Oniguruma, it denote
686
684
687
685
### Whitespace in ranges
688
686
689
-
In PCRE, `x{2,4}` is a range quantifier meaning that `x` can be matched from 2 to 4 times. However if any whitespace is introduced within the braces, it becomes an invalid range and is then treated as the literal characters instead. We find this behavior to be unintuitive, and therefore intend to parse any intermixed whitespace in the range, but will emit a warning telling users that we're doing so (**TODO: how would they silence? move to modern syntax?**).
687
+
In PCRE, `x{2,4}` is a range quantifier meaning that `x` can be matched from 2 to 4 times. However if any whitespace is introduced within the braces, it becomes an invalid range and is then treated as the literal characters instead. We find this behavior to be unintuitive, and therefore intend to parse any intermixed whitespace in the range.
690
688
691
689
### Implicitly-scoped matching option scopes
692
690
@@ -716,6 +714,12 @@ PCRE supports `\N` meaning "not a newline", however there are engines that treat
716
714
717
715
ICU unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`, such that they follow the same internal grammar, which allows referencing any Unicode character property in addition to the POSIX properties. We intend to support this, though it is a purely additive feature, and therefore should not conflict with regex engines that implement a more limited POSIX syntax.
718
716
717
+
### Script properties
718
+
719
+
Shorthand script property syntax e.g `\p{Latin}` is treated as `\p{Script=Latin}` by PCRE, ICU, Oniguruma, and Java. These use [the Unicode Script property][unicode-scripts], which assigns each scalar a particular script value. However, there are scalars that may appear in multiple scripts, e.g U+3003 DITTO MARK. These often get assigned to the `Common` script to reflect this fact, which is not particularly useful for matching purposes. To provide more fine-grained script matching, Unicode provides [the Script Extension property][unicode-script-extensions], which exposes the set of scripts that a scalar appears in.
720
+
721
+
As such we feel that the more desirable default behavior of shorthand script property syntax e.g `\p{Latin}` is for it to be treated as `\p{Script_Extension=Latin}`. This matches Perl's default behavior. Plain script properties may still be written using the more explicit syntax e.g `\p{Script=Latin}` and `\p{sc=Latin}`.
722
+
719
723
### Extended syntax modes
720
724
721
725
Various regex engines offer an "extended syntax" where whitespace is treated as non-semantic (e.g `a b c` is equivalent to `abc`), in addition to allowing end-of-line comments `# comment`. In PCRE, this enabled through the `(?x)` and `(?xx)` matching options, where the former allows non-semantic whitespace outside of character classes, and the latter also allows non-semantic whitespace in custom character classes.
@@ -793,3 +797,5 @@ We plan on choosing the canonical spelling **TODO: decide**.
0 commit comments