You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexSyntax.md
+35-9Lines changed: 35 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -103,10 +103,15 @@ Atom -> Anchor
103
103
| NamedCharacter
104
104
| Subpattern
105
105
| UniScalar
106
+
| '\K'
106
107
| '\'? <Character>
107
108
```
108
109
109
-
Atoms are the smallest units of regular expression syntax. They include escape sequences e.g `\b`, `\d`, as well as meta-characters such as `.` and `$`. They also include some larger syntactic constructs such as backreferences and callouts. The most basic form of atom is a literal character. A meta-character may be treated as literal by preceding it with a backslash. Other characters may also be preceded with a backslash, but it has no effect, e.g `\I` is literal `I`.
110
+
Atoms are the smallest units of regular expression syntax. They include escape sequences e.g `\b`, `\d`, as well as meta-characters such as `.` and `$`. They also include some larger syntactic constructs such as backreferences and callouts. The most basic form of atom is a literal character. A meta-character may be treated as literal by preceding it with a backslash. Other characters may also be preceded with a backslash, but it has no effect if they are unknown escape sequences, e.g `\I` is literal `I`.
111
+
112
+
#### `\K`
113
+
114
+
The `\K` escape sequence is used to drop any previously matched characters from the final matching result. It does not however interfere with captures, e.g `a(b)\Kc` when matching against `abc` will return a match of `c`, but with a capture of `b`.
110
115
111
116
### Groups
112
117
@@ -261,7 +266,7 @@ HexDigit -> [0-9a-zA-Z]
261
266
OctalDigit -> [0-7]
262
267
```
263
268
264
-
These sequences define a unicode scalar value to be matched against. There is both syntax for specifying the scalar value in hex notation, as well as octal notation.
269
+
These sequences define a unicode scalar value to be matched against. There is both syntax for specifying the scalar value in hex notation, as well as octal notation. Note that `\x` that is not followed by any hexadecimal digit characters is treated as `\0`, which matches PCRE's behavior.
265
270
266
271
### Escape sequences
267
272
@@ -283,10 +288,11 @@ These escape sequences denote a specific character.
Custom characters classes introduce their own language, in which most regular expression metacharacters become literal. The basic element in a custom character class is an `Atom`, though only a few atoms are considered valid:
318
324
319
-
- Builtin character classes, except `.`, `\O`, `\X`, and `\N`.
325
+
- Builtin character classes except `.`, `\R`, `\O`, `\X`, `\C`, and `\N`.
320
326
- Escape sequences, including `\b` which becomes the backspace character (rather than a word boundary)
321
327
- Unicode scalars
322
328
- Named characters
@@ -325,9 +331,11 @@ Custom characters classes introduce their own language, in which most regular ex
325
331
326
332
Atoms may be used to compose other character class members, including ranges, quoted sequences, and even nested custom character classes `[[ab]c\d]`. Adjacent members form an implicit union of character classes, e.g `[[ab]c\d]` is the union of the characters `a`, `b`, `c`, and digit characters.
327
333
334
+
Custom character classes may not be empty, e.g `[]` is forbidden. A custom character class may begin with the `]` character, in which case it is treated as literal, e.g `[]a]` is the custom character class of `]` and `a`.
335
+
328
336
Quoted sequences may be used to escape the contained characters, e.g `[\Q]\E]` is the character class of the literal character `[`.
329
337
330
-
Ranges of characters may be specified with `-`, e.g `[a-z]` matches against the letters from `a` to `z`. Only unicode scalars and literal characters are valid range operands. If `-`appears at the start or end of a custom character class, it is interpreted as literal, e.g `[-a-]` is the character class of `-` and `a`.
338
+
Ranges of characters may be specified with `-`, e.g `[a-z]` matches against the letters from `a` to `z`. Only unicode scalars and literal characters are valid range operands. If `-`cannot be used to form a range, it is interpreted as literal, e.g `[-a-]` is the character class of `-` and `a`. `[a-c-d]` is the character class of `a`...`c`, `-`, and `d`.
331
339
332
340
Operators may be used to apply set operations to character class members. The operators supported are:
A character property specifies a particular Unicodeor POSIX property to match against. We intend on parsing:
361
+
A character property specifies a particular Unicode, POSIX, or PCRE property to match against. We intend on parsing:
354
362
355
363
- The full range of Unicode character properties.
356
364
- The POSIX properties `alnum`, `blank`, `graph`, `print`, `word`, `xdigit` (note that `alpha`, `lower`, `upper`, `space`, `punct`, `digit`, and `cntrl` are covered by Unicode properties).
357
365
- The UTS#18 special properties `any`, `assigned`, `ascii`.
366
+
- The special PCRE2 properties `Xan`, `Xps`, `Xsp`, `Xuc`, `Xwd`.
358
367
359
368
We intend on following [UTS#18][uts18]'s guidance for character properties. This includes the use of fuzzy matching for property name parsing. This is done according to rules set out by [UAX44-LM3]. This means that the following property names are considered equivalent:
360
369
@@ -369,8 +378,9 @@ Unicode properties consist of both a key and a value, e.g `General_Category=Whit
369
378
There are some Unicode properties where the key or value may be inferred. These include:
370
379
371
380
- General category properties e.g `\p{Whitespace}` is inferred as `\p{General_Category=Whitespace}`.
372
-
- Script properties e.g `\p{Greek}` is inferred as `\p{Script=Greek}`.
381
+
- Script properties e.g `\p{Greek}` is inferred as `\p{Script=Greek}`.**TODO: Infer as `\p{scx=Greek}` instead?**
373
382
- Boolean properties that are inferred to have a `True` value, e.g `\p{Lowercase}` is inferred as `\p{Lowercase=True}`.
383
+
- Block properties that begin with the prefix `in`, e.g `\p{inBasicLatin}` is inferred to be `\p{Block=Basic_Latin}`.
374
384
375
385
Other Unicode properties however must specify both a key and value.
Allows a specific Unicode scalar to be specified by name or code point.
401
+
Allows a specific Unicode scalar to be specified by name or hexadecimal code point.
392
402
393
403
**TODO: Should this be called "named scalar" or similar?**
394
404
@@ -696,7 +706,7 @@ PCRE and .NET allow for conditional patterns to reference a group by its name, e
696
706
697
707
where `y` will only be matched if `(?<group1>x)` was matched. PCRE will always treat such syntax as a backreference condition, however .NET will only treat it as such if a group with that name exists somewhere in the regex (including after the conditional). Otherwise, .NET interprets `group1` as an arbitrary regular expression condition to try match against.
698
708
699
-
We intend to always parse such conditions as an arbitrary regular expression condition, and will emit a warning asking users to explicitly use the syntax `(?('group1')y)` if they want a backreference condition. This more explicit syntax is supported by PCRE.
709
+
We intend to always parse such conditions as an arbitrary regular expression condition, and will emit a warning asking users to explicitly use the syntax `(?(<group1>)y)` if they want a backreference condition. This more explicit syntax is supported by PCRE.**TODO: Is the opposite more common?**
700
710
701
711
### `\N`
702
712
@@ -742,6 +752,16 @@ We intend on matching the PCRE behavior where groups are numbered purely based o
742
752
743
753
Many engines have different spellings for the same regex features, and as such we need to decide on a preferred canonical syntax.
744
754
755
+
### Unicode scalars
756
+
757
+
### Character properties
758
+
759
+
### Groups
760
+
761
+
#### Named
762
+
763
+
#### Lookaheads and lookbehinds
764
+
745
765
### Backreferences
746
766
747
767
There are a variety of backreference spellings accepted by different engines
0 commit comments