Skip to content

Commit bbb7756

Browse files
committed
Update RegexSyntax.md
1 parent c913eea commit bbb7756

File tree

1 file changed

+16
-10
lines changed

1 file changed

+16
-10
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ Atom -> Anchor
100100
| Callout
101101
| CharacterProperty
102102
| EscapeSequence
103-
| NamedCharacter
103+
| NamedScalar
104104
| Subpattern
105105
| UniScalar
106106
| '\K'
@@ -266,7 +266,7 @@ HexDigit -> [0-9a-zA-Z]
266266
OctalDigit -> [0-7]
267267
```
268268

269-
These sequences define a unicode scalar value to be matched against. There is both syntax for specifying the scalar value in hex notation, as well as octal notation. Note that `\x` that is not followed by any hexadecimal digit characters is treated as `\0`, which matches PCRE's behavior.
269+
These sequences define a unicode scalar value to be matched against. There is both syntax for specifying the scalar value in hex notation, as well as octal notation. Note that `\x`, when not followed by any hexadecimal digit characters, is treated as `\0`, matching PCRE's behavior.
270270

271271
### Escape sequences
272272

@@ -325,7 +325,7 @@ Custom characters classes introduce their own language, in which most regular ex
325325
- Builtin character classes except `.`, `\R`, `\O`, `\X`, `\C`, and `\N`.
326326
- Escape sequences, including `\b` which becomes the backspace character (rather than a word boundary)
327327
- Unicode scalars
328-
- Named characters
328+
- Named scalars
329329
- Character properties
330330
- Plain literal characters
331331

@@ -378,7 +378,7 @@ Unicode properties consist of both a key and a value, e.g `General_Category=Whit
378378
There are some Unicode properties where the key or value may be inferred. These include:
379379

380380
- General category properties e.g `\p{Whitespace}` is inferred as `\p{General_Category=Whitespace}`.
381-
- Script properties e.g `\p{Greek}` is inferred as `\p{Script=Greek}`. **TODO: Infer as `\p{scx=Greek}` instead?**
381+
- Script properties e.g `\p{Greek}` is inferred as `\p{Script_Extensions=Greek}`.
382382
- Boolean properties that are inferred to have a `True` value, e.g `\p{Lowercase}` is inferred as `\p{Lowercase=True}`.
383383
- Block properties that begin with the prefix `in`, e.g `\p{inBasicLatin}` is inferred to be `\p{Block=Basic_Latin}`.
384384

@@ -391,17 +391,15 @@ For non-Unicode properties, only a value is required. These include:
391391

392392
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`.
393393

394-
### Named characters
394+
### Named scalars
395395

396396
```
397-
NamedCharacter -> '\N{' CharName '}'
398-
CharName -> 'U+' HexDigit{1...8} | [\s\w-]+
397+
NamedScalar -> '\N{' ScalarName '}'
398+
ScalarName -> 'U+' HexDigit{1...8} | [\s\w-]+
399399
```
400400

401401
Allows a specific Unicode scalar to be specified by name or hexadecimal code point.
402402

403-
**TODO: Should this be called "named scalar" or similar?**
404-
405403
### Trivia
406404

407405
```
@@ -686,7 +684,7 @@ In PCRE, a bare `\x` denotes the NUL character (`U+00`). In Oniguruma, it denote
686684

687685
### Whitespace in ranges
688686

689-
In PCRE, `x{2,4}` is a range quantifier meaning that `x` can be matched from 2 to 4 times. However if any whitespace is introduced within the braces, it becomes an invalid range and is then treated as the literal characters instead. We find this behavior to be unintuitive, and therefore intend to parse any intermixed whitespace in the range, but will emit a warning telling users that we're doing so (**TODO: how would they silence? move to modern syntax?**).
687+
In PCRE, `x{2,4}` is a range quantifier meaning that `x` can be matched from 2 to 4 times. However if any whitespace is introduced within the braces, it becomes an invalid range and is then treated as the literal characters instead. We find this behavior to be unintuitive, and therefore intend to parse any intermixed whitespace in the range.
690688

691689
### Implicitly-scoped matching option scopes
692690

@@ -716,6 +714,12 @@ PCRE supports `\N` meaning "not a newline", however there are engines that treat
716714

717715
ICU unifies the character property syntax `\p{...}` with the syntax for POSIX character classes `[:...:]`, such that they follow the same internal grammar, which allows referencing any Unicode character property in addition to the POSIX properties. We intend to support this, though it is a purely additive feature, and therefore should not conflict with regex engines that implement a more limited POSIX syntax.
718716

717+
### Script properties
718+
719+
Shorthand script property syntax e.g `\p{Latin}` is treated as `\p{Script=Latin}` by PCRE, ICU, Oniguruma, and Java. These use [the Unicode Script property][unicode-scripts], which assigns each scalar a particular script value. However, there are scalars that may appear in multiple scripts, e.g U+3003 DITTO MARK. These often get assigned to the `Common` script to reflect this fact, which is not particularly useful for matching purposes. To provide more fine-grained script matching, Unicode provides [the Script Extension property][unicode-script-extensions], which exposes the set of scripts that a scalar appears in.
720+
721+
As such we feel that the more desirable default behavior of shorthand script property syntax e.g `\p{Latin}` is for it to be treated as `\p{Script_Extension=Latin}`. This matches Perl's default behavior. Plain script properties may still be written using the more explicit syntax e.g `\p{Script=Latin}` and `\p{sc=Latin}`.
722+
719723
### Extended syntax modes
720724

721725
Various regex engines offer an "extended syntax" where whitespace is treated as non-semantic (e.g `a b c` is equivalent to `abc`), in addition to allowing end-of-line comments `# comment`. In PCRE, this enabled through the `(?x)` and `(?xx)` matching options, where the former allows non-semantic whitespace outside of character classes, and the latter also allows non-semantic whitespace in custom character classes.
@@ -793,3 +797,5 @@ We plan on choosing the canonical spelling **TODO: decide**.
793797
[UAX44-LM3]: https://www.unicode.org/reports/tr44/#UAX44-LM3
794798
[unicode-prop-key-aliases]: https://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt
795799
[unicode-prop-value-aliases]: https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt
800+
[unicode-scripts]: https://www.unicode.org/reports/tr24/#Script
801+
[unicode-script-extensions]: https://www.unicode.org/reports/tr24/#Script_Extensions

0 commit comments

Comments
 (0)