Skip to content

Commit c913eea

Browse files
committed
Update RegexSyntax.md
1 parent 8dc2b4b commit c913eea

File tree

1 file changed

+35
-9
lines changed

1 file changed

+35
-9
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 35 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -103,10 +103,15 @@ Atom -> Anchor
103103
| NamedCharacter
104104
| Subpattern
105105
| UniScalar
106+
| '\K'
106107
| '\'? <Character>
107108
```
108109

109-
Atoms are the smallest units of regular expression syntax. They include escape sequences e.g `\b`, `\d`, as well as meta-characters such as `.` and `$`. They also include some larger syntactic constructs such as backreferences and callouts. The most basic form of atom is a literal character. A meta-character may be treated as literal by preceding it with a backslash. Other characters may also be preceded with a backslash, but it has no effect, e.g `\I` is literal `I`.
110+
Atoms are the smallest units of regular expression syntax. They include escape sequences e.g `\b`, `\d`, as well as meta-characters such as `.` and `$`. They also include some larger syntactic constructs such as backreferences and callouts. The most basic form of atom is a literal character. A meta-character may be treated as literal by preceding it with a backslash. Other characters may also be preceded with a backslash, but it has no effect if they are unknown escape sequences, e.g `\I` is literal `I`.
111+
112+
#### `\K`
113+
114+
The `\K` escape sequence is used to drop any previously matched characters from the final matching result. It does not however interfere with captures, e.g `a(b)\Kc` when matching against `abc` will return a match of `c`, but with a capture of `b`.
110115

111116
### Groups
112117

@@ -261,7 +266,7 @@ HexDigit -> [0-9a-zA-Z]
261266
OctalDigit -> [0-7]
262267
```
263268

264-
These sequences define a unicode scalar value to be matched against. There is both syntax for specifying the scalar value in hex notation, as well as octal notation.
269+
These sequences define a unicode scalar value to be matched against. There is both syntax for specifying the scalar value in hex notation, as well as octal notation. Note that `\x` that is not followed by any hexadecimal digit characters is treated as `\0`, which matches PCRE's behavior.
265270

266271
### Escape sequences
267272

@@ -283,10 +288,11 @@ These escape sequences denote a specific character.
283288
### Builtin character classes
284289

285290
```
286-
BuiltinCharClass -> '.' | '\d' | '\D' | '\h' | '\H' | '\N' | '\O' | '\R' | '\s' | '\S' | '\v' | '\V' | '\w' | '\W' | '\X'
291+
BuiltinCharClass -> '.' | '\C' | '\d' | '\D' | '\h' | '\H' | '\N' | '\O' | '\R' | '\s' | '\S' | '\v' | '\V' | '\w' | '\W' | '\X'
287292
```
288293

289294
- `.`: Any character excluding newlines
295+
- `\C`: A single UTF code unit
290296
- `\d`: Digit character
291297
- `\D`: Non-digit character
292298
- `\h`: Horizontal space character
@@ -316,7 +322,7 @@ SetOp -> '&&' | '--' | '~~' | '-'
316322

317323
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal. The basic element in a custom character class is an `Atom`, though only a few atoms are considered valid:
318324

319-
- Builtin character classes, except `.`, `\O`, `\X`, and `\N`.
325+
- Builtin character classes except `.`, `\R`, `\O`, `\X`, `\C`, and `\N`.
320326
- Escape sequences, including `\b` which becomes the backspace character (rather than a word boundary)
321327
- Unicode scalars
322328
- Named characters
@@ -325,9 +331,11 @@ Custom characters classes introduce their own language, in which most regular ex
325331

326332
Atoms may be used to compose other character class members, including ranges, quoted sequences, and even nested custom character classes `[[ab]c\d]`. Adjacent members form an implicit union of character classes, e.g `[[ab]c\d]` is the union of the characters `a`, `b`, `c`, and digit characters.
327333

334+
Custom character classes may not be empty, e.g `[]` is forbidden. A custom character class may begin with the `]` character, in which case it is treated as literal, e.g `[]a]` is the custom character class of `]` and `a`.
335+
328336
Quoted sequences may be used to escape the contained characters, e.g `[\Q]\E]` is the character class of the literal character `[`.
329337

330-
Ranges of characters may be specified with `-`, e.g `[a-z]` matches against the letters from `a` to `z`. Only unicode scalars and literal characters are valid range operands. If `-` appears at the start or end of a custom character class, it is interpreted as literal, e.g `[-a-]` is the character class of `-` and `a`.
338+
Ranges of characters may be specified with `-`, e.g `[a-z]` matches against the letters from `a` to `z`. Only unicode scalars and literal characters are valid range operands. If `-` cannot be used to form a range, it is interpreted as literal, e.g `[-a-]` is the character class of `-` and `a`. `[a-c-d]` is the character class of `a`...`c`, `-`, and `d`.
331339

332340
Operators may be used to apply set operations to character class members. The operators supported are:
333341

@@ -350,11 +358,12 @@ PropertyContents -> PropertyName ('=' PropertyName)?
350358
PropertyName -> [\s\w-]+
351359
```
352360

353-
A character property specifies a particular Unicode or POSIX property to match against. We intend on parsing:
361+
A character property specifies a particular Unicode, POSIX, or PCRE property to match against. We intend on parsing:
354362

355363
- The full range of Unicode character properties.
356364
- The POSIX properties `alnum`, `blank`, `graph`, `print`, `word`, `xdigit` (note that `alpha`, `lower`, `upper`, `space`, `punct`, `digit`, and `cntrl` are covered by Unicode properties).
357365
- The UTS#18 special properties `any`, `assigned`, `ascii`.
366+
- The special PCRE2 properties `Xan`, `Xps`, `Xsp`, `Xuc`, `Xwd`.
358367

359368
We intend on following [UTS#18][uts18]'s guidance for character properties. This includes the use of fuzzy matching for property name parsing. This is done according to rules set out by [UAX44-LM3]. This means that the following property names are considered equivalent:
360369

@@ -369,8 +378,9 @@ Unicode properties consist of both a key and a value, e.g `General_Category=Whit
369378
There are some Unicode properties where the key or value may be inferred. These include:
370379

371380
- General category properties e.g `\p{Whitespace}` is inferred as `\p{General_Category=Whitespace}`.
372-
- Script properties e.g `\p{Greek}` is inferred as `\p{Script=Greek}`.
381+
- Script properties e.g `\p{Greek}` is inferred as `\p{Script=Greek}`. **TODO: Infer as `\p{scx=Greek}` instead?**
373382
- Boolean properties that are inferred to have a `True` value, e.g `\p{Lowercase}` is inferred as `\p{Lowercase=True}`.
383+
- Block properties that begin with the prefix `in`, e.g `\p{inBasicLatin}` is inferred to be `\p{Block=Basic_Latin}`.
374384

375385
Other Unicode properties however must specify both a key and value.
376386

@@ -388,7 +398,7 @@ NamedCharacter -> '\N{' CharName '}'
388398
CharName -> 'U+' HexDigit{1...8} | [\s\w-]+
389399
```
390400

391-
Allows a specific Unicode scalar to be specified by name or code point.
401+
Allows a specific Unicode scalar to be specified by name or hexadecimal code point.
392402

393403
**TODO: Should this be called "named scalar" or similar?**
394404

@@ -696,7 +706,7 @@ PCRE and .NET allow for conditional patterns to reference a group by its name, e
696706

697707
where `y` will only be matched if `(?<group1>x)` was matched. PCRE will always treat such syntax as a backreference condition, however .NET will only treat it as such if a group with that name exists somewhere in the regex (including after the conditional). Otherwise, .NET interprets `group1` as an arbitrary regular expression condition to try match against.
698708

699-
We intend to always parse such conditions as an arbitrary regular expression condition, and will emit a warning asking users to explicitly use the syntax `(?('group1')y)` if they want a backreference condition. This more explicit syntax is supported by PCRE.
709+
We intend to always parse such conditions as an arbitrary regular expression condition, and will emit a warning asking users to explicitly use the syntax `(?(<group1>)y)` if they want a backreference condition. This more explicit syntax is supported by PCRE. **TODO: Is the opposite more common?**
700710

701711
### `\N`
702712

@@ -742,6 +752,16 @@ We intend on matching the PCRE behavior where groups are numbered purely based o
742752

743753
Many engines have different spellings for the same regex features, and as such we need to decide on a preferred canonical syntax.
744754

755+
### Unicode scalars
756+
757+
### Character properties
758+
759+
### Groups
760+
761+
#### Named
762+
763+
#### Lookaheads and lookbehinds
764+
745765
### Backreferences
746766

747767
There are a variety of backreference spellings accepted by different engines
@@ -758,6 +778,12 @@ Backreference -> '\g{' NameOrNumberRef '}'
758778

759779
We plan on choosing the canonical spelling **TODO: decide**.
760780

781+
### Subpattern
782+
783+
### Conditional references
784+
785+
### Callouts
786+
761787

762788
[pcre2-syntax]: https://www.pcre.org/current/doc/html/pcre2syntax.html
763789
[oniguruma-syntax]: https://github.com/kkos/oniguruma/blob/master/doc/RE

0 commit comments

Comments
 (0)