Skip to content

Commit cc314a2

Browse files
committed
Update RegexSyntax.md
1 parent ccc2a7b commit cc314a2

File tree

1 file changed

+39
-18
lines changed

1 file changed

+39
-18
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 39 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ Hello, we want to issue an update to [Regular Expression Literals](https://forum
1010

1111
Regex literals declare a string processing algorithm using syntax familiar across a variety of languages and tools throughout programming history. Formalizing regex literals in Swift requires:
1212

13-
- Choosing a delimiter (e.g. `#/.../#` or `re'...'`)
14-
- Detailing the "interior syntax" accepted in between delimiters
13+
- Choosing a delimiter (e.g. `#/.../#` or `re'...'`).
14+
- Detailing the "interior syntax" accepted in between delimiters.
1515
- Specifying actual types and relevant protocols for the literal.
1616

1717
We present a detailed and comprehensive treatment of regex literal interior syntax. The syntax we're proposing is large enough for its own dedicated discussion ahead of a full regex literal proposal.
@@ -48,18 +48,32 @@ Regex literal interior syntax will be part of Swift's source-compatibility story
4848

4949
We propose the following syntax for use inside Swift regex literals.
5050

51-
*TODO:* Disclosure triangle explaining the grammar conventions?
51+
<details><summary>Grammar Conventions</summary>
52+
53+
Elements of the grammar are defined using the syntax `Element -> <Definition>`.
54+
55+
Quoted characters e.g `'abc'`, `"abc"` in the grammar match against the literal characters. Unquoted names e.g `Concatenation` refer to other definitions in the grammar.
56+
57+
The `|` operator is used to specify that the grammar can match against either branch of the operator, similar to a regular expression. Similarly, `*`, `+`, and `?` are used to quantify an element of the grammar, with the same meaning as in regular expressions. Range quantifiers `{...}` may also be used, though we adopt a more explicit syntax that uses the Swift `..<` & `...` operators, e.g `{1...4}`.
58+
59+
Basic custom character classes may appear in the grammar, and have the same meaning as in a regular expression. For example `[0-9a-zA-Z]` expresses the digits `0` to `9` and the letters `a` to `z` (both upper and lowercase).
60+
61+
The `!` prefix operator is used to specify that the following grammar element must not appear at that position.
62+
63+
Grammar elements may be surrounded by parentheses for the purposes of quantification.
64+
65+
</details>
5266

5367
### Top-level regular expression
5468

5569
```
56-
Regex -> GlobalMatchingOptionSequence? RegexNode
57-
RegexNode -> '' | Alternation
58-
Alternation -> Concatenation ('|' Concatenation)*
59-
Concatenation -> (!'|' !')' ConcatComponent)*
70+
Regex -> GlobalMatchingOptionSequence? RegexNode
71+
RegexNode -> '' | Alternation
72+
Alternation -> Concatenation ('|' Concatenation)*
73+
Concatenation -> (!'|' !')' ConcatComponent)*
6074
```
6175

62-
A regex literal may be prefixed with a sequence of global matching options(*TODO*: intra-doc link). A literal's contents can be empty or a sequence of alternatives separated by `|`.
76+
A regex literal may be prefixed with a sequence of [global matching options](#pcre-global-matching-options). A literal's contents can be empty or a sequence of alternatives separated by `|`.
6377

6478
Alternatives are a series of expressions concatenated together. The concatentation ends with either a `|` denoting the end of the alternative or a `)` denoting the end of a recursively parsed group.
6579

@@ -192,7 +206,7 @@ BuiltinCharClass -> '.' | '\C' | '\d' | '\D' | '\h' | '\H' | '\N' | '\O' | '\R'
192206
- `\W`: Non-word character.
193207
- `\X`: Any extended grapheme cluster.
194208

195-
Precise definitions of character classes is discussed in (Character Classes for String Processing)[https://forums.swift.org/t/pitch-character-classes-for-string-processing/52920].
209+
Precise definitions of character classes is discussed in [Character Classes for String Processing](https://forums.swift.org/t/pitch-character-classes-for-string-processing/52920).
196210

197211
#### Unicode scalars
198212

@@ -234,7 +248,7 @@ A character property specifies a particular Unicode, POSIX, or PCRE property to
234248
- The POSIX properties `alnum`, `blank`, `graph`, `print`, `word`, `xdigit` (note that `alpha`, `lower`, `upper`, `space`, `punct`, `digit`, and `cntrl` are covered by Unicode properties).
235249
- The UTS#18 special properties `any`, `assigned`, `ascii`.
236250
- The special PCRE2 properties `Xan`, `Xps`, `Xsp`, `Xuc`, `Xwd`.
237-
- The special Java property `javaLowerCase`
251+
- The special Java properties `javaLowerCase`, `javaUpperCase`, `javaWhitespace`, `javaMirrored`.
238252

239253
We follow [UTS#18][uts18]'s guidance for character properties, including fuzzy matching for property name parsing, according to rules set out by [UAX44-LM3]. The following property names are equivalent:
240254

@@ -257,12 +271,14 @@ Other Unicode properties however must specify both a key and value.
257271

258272
For non-Unicode properties, only a value is required. These include:
259273

260-
- The special properties `any`, `assigned`, `ascii`.
274+
- The UTS#18 special properties `any`, `assigned`, `ascii`.
261275
- The POSIX compatibility properties `alnum`, `blank`, `graph`, `print`, `word`, `xdigit`. The remaining POSIX properties are already covered by boolean Unicode property spellings.
276+
- The special PCRE2 properties `Xan`, `Xps`, `Xsp`, `Xuc`, `Xwd`.
277+
- The special Java properties `javaLowerCase`, `javaUpperCase`, `javaWhitespace`, `javaMirrored`.
262278

263279
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`.
264280

265-
#### `\K`
281+
### `\K`
266282

267283
The `\K` escape sequence is used to drop any previously matched characters from the final matching result. It does not affect captures, e.g `a(b)\Kc` when matching against `abc` will return a match of `c`, but with a capture of `b`.
268284

@@ -298,13 +314,13 @@ Identifier -> [\w--\d] \w*
298314

299315
Groups define a new scope that contains a recursively nested regex. Groups have different semantics depending on how they are introduced.
300316

301-
Note there are additional constructs that may syntactically appear similar to groups, but are distinct. See the *group-like atoms* section.
317+
Note there are additional constructs that may syntactically appear similar to groups, such as backreferences and conditionals, but are distinct.
302318

303319
#### Basic group kinds
304320

305321
- `()`: A capturing group.
306322
- `(?:)`: A non-capturing group.
307-
- `(?|)`: A group that, for a direct child alternation, resets the numbering of groups at each branch of that alternation. See *group numbering*.
323+
- `(?|)`: A group that, for a direct child alternation, resets the numbering of groups at each branch of that alternation. See [Group Numbering](#group-numbering).
308324

309325
Capturing groups produce captures, which remember the range of input matched for the scope of that group.
310326

@@ -427,7 +443,7 @@ We support all the matching options accepted by PCRE, ICU, and Oniguruma. In add
427443
- `n`: Disables the capturing behavior of `(...)` groups. Named capture groups must be used instead.
428444
- `s`: Changes `.` to match any character, including newlines.
429445
- `U`: Changes quantifiers to be reluctant by default, with the `?` specifier changing to mean greedy.
430-
- `x`, `xx`: Enables extended syntax mode, which allows non-semantic whitespace and end-of-line comments. See the *trivia* section for more info.
446+
- `x`, `xx`: Enables extended syntax mode, which allows non-semantic whitespace and end-of-line comments. See [Extended Syntax Modes](#extended-syntax-modes) for more info.
431447

432448
#### ICU options
433449

@@ -513,7 +529,7 @@ PCREVersionCheck -> '>'? '=' PCREVersionNumber
513529
PCREVersionNumber -> <Int> '.' <Int>
514530
```
515531

516-
A conditional evaluates a particular condition, and chooses a branch to match against accordingly. 1 or 2 branches may be specified. If 1 branch is specified e.g `(?(...)x)`, it is treated as the true branch. Note this includes an empty true branch, e.g `(?(...))` which is the null pattern as described in *top-level regular expression*. If 2 branches are specified, e.g `(?(...)x|y)`, the first is treated as the true branch, the second being the false branch.
532+
A conditional evaluates a particular condition, and chooses a branch to match against accordingly. 1 or 2 branches may be specified. If 1 branch is specified e.g `(?(...)x)`, it is treated as the true branch. Note this includes an empty true branch, e.g `(?(...))` which is the null pattern as described in the [Top-Level Regular Expression](#top-level-regular-expression) section. If 2 branches are specified, e.g `(?(...)x|y)`, the first is treated as the true branch, the second being the false branch.
517533

518534
A condition may be:
519535

@@ -787,12 +803,17 @@ UnicodeScalar -> '\u{' HexDigit{1...} '}'
787803
788804
HexDigit -> [0-9a-zA-Z]
789805
OctalDigit -> [0-7]
806+
807+
NamedScalar -> '\N{' ScalarName '}'
808+
ScalarName -> 'U+' HexDigit{1...8} | [\s\w-]+
790809
```
791810

792-
For consistency with String escape syntax, we intend on canonicalizing to `\u{...}`.
811+
There are multiple equivalent ways of spelling the same the Unicode scalar value, in either hex, octal, or by spelling the name explicitly. String literals already provide a `\u{...}` syntax that allow a hex sequence for a Unicode scalar. As this is Swift's existing preferred spelling for such a sequence, we consider it to be the preferred spelling in this case too. There may however be value in preserving scalars that are explicitly spelled by name with `\N{...}` for clarity.
793812

794813
### Character properties
795814

815+
Character properties `\p{...}` have a variety of alternative spellings due to fuzzy matching, Unicode aliases, and shorthand syntax for common Unicode properties. They also may be written using POSIX syntax e.g `[:gc=Whitespace:]`.
816+
796817
**TODO: Should we canonicalize on e.g `\p{Script_Extensions=Greek}`? Or prefer the shorthand where we can? Or just avoid canonicalizing?**
797818

798819
### Groups
@@ -906,4 +927,4 @@ Note that this proposal regards _syntactic_ support, and does not necessarily me
906927
[unicode-prop-key-aliases]: https://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt
907928
[unicode-prop-value-aliases]: https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt
908929
[unicode-scripts]: https://www.unicode.org/reports/tr24/#Script
909-
[unicode-script-extensions]: https://www.unicode.org/reports/tr24/#Script_Extensions
930+
[unicode-script-extensions]: https://www.unicode.org/reports/tr24/#Script_Extensions

0 commit comments

Comments
 (0)