You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Documentation/Evolution/RegexSyntax.md
+39-18Lines changed: 39 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,8 +10,8 @@ Hello, we want to issue an update to [Regular Expression Literals](https://forum
10
10
11
11
Regex literals declare a string processing algorithm using syntax familiar across a variety of languages and tools throughout programming history. Formalizing regex literals in Swift requires:
12
12
13
-
- Choosing a delimiter (e.g. `#/.../#` or `re'...'`)
14
-
- Detailing the "interior syntax" accepted in between delimiters
13
+
- Choosing a delimiter (e.g. `#/.../#` or `re'...'`).
14
+
- Detailing the "interior syntax" accepted in between delimiters.
15
15
- Specifying actual types and relevant protocols for the literal.
16
16
17
17
We present a detailed and comprehensive treatment of regex literal interior syntax. The syntax we're proposing is large enough for its own dedicated discussion ahead of a full regex literal proposal.
@@ -48,18 +48,32 @@ Regex literal interior syntax will be part of Swift's source-compatibility story
48
48
49
49
We propose the following syntax for use inside Swift regex literals.
50
50
51
-
*TODO:* Disclosure triangle explaining the grammar conventions?
51
+
<details><summary>Grammar Conventions</summary>
52
+
53
+
Elements of the grammar are defined using the syntax `Element -> <Definition>`.
54
+
55
+
Quoted characters e.g `'abc'`, `"abc"` in the grammar match against the literal characters. Unquoted names e.g `Concatenation` refer to other definitions in the grammar.
56
+
57
+
The `|` operator is used to specify that the grammar can match against either branch of the operator, similar to a regular expression. Similarly, `*`, `+`, and `?` are used to quantify an element of the grammar, with the same meaning as in regular expressions. Range quantifiers `{...}` may also be used, though we adopt a more explicit syntax that uses the Swift `..<` & `...` operators, e.g `{1...4}`.
58
+
59
+
Basic custom character classes may appear in the grammar, and have the same meaning as in a regular expression. For example `[0-9a-zA-Z]` expresses the digits `0` to `9` and the letters `a` to `z` (both upper and lowercase).
60
+
61
+
The `!` prefix operator is used to specify that the following grammar element must not appear at that position.
62
+
63
+
Grammar elements may be surrounded by parentheses for the purposes of quantification.
64
+
65
+
</details>
52
66
53
67
### Top-level regular expression
54
68
55
69
```
56
-
Regex -> GlobalMatchingOptionSequence? RegexNode
57
-
RegexNode -> '' | Alternation
58
-
Alternation -> Concatenation ('|' Concatenation)*
59
-
Concatenation -> (!'|' !')' ConcatComponent)*
70
+
Regex -> GlobalMatchingOptionSequence? RegexNode
71
+
RegexNode -> '' | Alternation
72
+
Alternation -> Concatenation ('|' Concatenation)*
73
+
Concatenation -> (!'|' !')' ConcatComponent)*
60
74
```
61
75
62
-
A regex literal may be prefixed with a sequence of global matching options(*TODO*: intra-doc link). A literal's contents can be empty or a sequence of alternatives separated by `|`.
76
+
A regex literal may be prefixed with a sequence of [global matching options](#pcre-global-matching-options). A literal's contents can be empty or a sequence of alternatives separated by `|`.
63
77
64
78
Alternatives are a series of expressions concatenated together. The concatentation ends with either a `|` denoting the end of the alternative or a `)` denoting the end of a recursively parsed group.
Precise definitions of character classes is discussed in (Character Classes for String Processing)[https://forums.swift.org/t/pitch-character-classes-for-string-processing/52920].
209
+
Precise definitions of character classes is discussed in [Character Classes for String Processing](https://forums.swift.org/t/pitch-character-classes-for-string-processing/52920).
196
210
197
211
#### Unicode scalars
198
212
@@ -234,7 +248,7 @@ A character property specifies a particular Unicode, POSIX, or PCRE property to
234
248
- The POSIX properties `alnum`, `blank`, `graph`, `print`, `word`, `xdigit` (note that `alpha`, `lower`, `upper`, `space`, `punct`, `digit`, and `cntrl` are covered by Unicode properties).
235
249
- The UTS#18 special properties `any`, `assigned`, `ascii`.
236
250
- The special PCRE2 properties `Xan`, `Xps`, `Xsp`, `Xuc`, `Xwd`.
237
-
- The special Java property`javaLowerCase`
251
+
- The special Java properties`javaLowerCase`, `javaUpperCase`, `javaWhitespace`, `javaMirrored`.
238
252
239
253
We follow [UTS#18][uts18]'s guidance for character properties, including fuzzy matching for property name parsing, according to rules set out by [UAX44-LM3]. The following property names are equivalent:
240
254
@@ -257,12 +271,14 @@ Other Unicode properties however must specify both a key and value.
257
271
258
272
For non-Unicode properties, only a value is required. These include:
259
273
260
-
- The special properties `any`, `assigned`, `ascii`.
274
+
- The UTS#18special properties `any`, `assigned`, `ascii`.
261
275
- The POSIX compatibility properties `alnum`, `blank`, `graph`, `print`, `word`, `xdigit`. The remaining POSIX properties are already covered by boolean Unicode property spellings.
276
+
- The special PCRE2 properties `Xan`, `Xps`, `Xsp`, `Xuc`, `Xwd`.
277
+
- The special Java properties `javaLowerCase`, `javaUpperCase`, `javaWhitespace`, `javaMirrored`.
262
278
263
279
Note that the internal `PropertyContents` syntax is shared by both the `\p{...}` and POSIX-style `[:...:]` syntax, allowing e.g `[:script=Latin:]` as well as `\p{alnum}`.
264
280
265
-
####`\K`
281
+
### `\K`
266
282
267
283
The `\K` escape sequence is used to drop any previously matched characters from the final matching result. It does not affect captures, e.g `a(b)\Kc` when matching against `abc` will return a match of `c`, but with a capture of `b`.
268
284
@@ -298,13 +314,13 @@ Identifier -> [\w--\d] \w*
298
314
299
315
Groups define a new scope that contains a recursively nested regex. Groups have different semantics depending on how they are introduced.
300
316
301
-
Note there are additional constructs that may syntactically appear similar to groups, but are distinct. See the *group-like atoms* section.
317
+
Note there are additional constructs that may syntactically appear similar to groups, such as backreferences and conditionals, but are distinct.
302
318
303
319
#### Basic group kinds
304
320
305
321
-`()`: A capturing group.
306
322
-`(?:)`: A non-capturing group.
307
-
-`(?|)`: A group that, for a direct child alternation, resets the numbering of groups at each branch of that alternation. See *group numbering*.
323
+
-`(?|)`: A group that, for a direct child alternation, resets the numbering of groups at each branch of that alternation. See [Group Numbering](#group-numbering).
308
324
309
325
Capturing groups produce captures, which remember the range of input matched for the scope of that group.
310
326
@@ -427,7 +443,7 @@ We support all the matching options accepted by PCRE, ICU, and Oniguruma. In add
427
443
-`n`: Disables the capturing behavior of `(...)` groups. Named capture groups must be used instead.
428
444
-`s`: Changes `.` to match any character, including newlines.
429
445
-`U`: Changes quantifiers to be reluctant by default, with the `?` specifier changing to mean greedy.
430
-
-`x`, `xx`: Enables extended syntax mode, which allows non-semantic whitespace and end-of-line comments. See the *trivia* section for more info.
446
+
-`x`, `xx`: Enables extended syntax mode, which allows non-semantic whitespace and end-of-line comments. See [Extended Syntax Modes](#extended-syntax-modes) for more info.
A conditional evaluates a particular condition, and chooses a branch to match against accordingly. 1 or 2 branches may be specified. If 1 branch is specified e.g `(?(...)x)`, it is treated as the true branch. Note this includes an empty true branch, e.g `(?(...))` which is the null pattern as described in *top-levelregularexpression*. If 2 branches are specified, e.g `(?(...)x|y)`, the first is treated as the true branch, the second being the false branch.
532
+
A conditional evaluates a particular condition, and chooses a branch to match against accordingly. 1 or 2 branches may be specified. If 1 branch is specified e.g `(?(...)x)`, it is treated as the true branch. Note this includes an empty true branch, e.g `(?(...))` which is the null pattern as described in the [Top-Level Regular Expression](#top-level-regular-expression) section. If 2 branches are specified, e.g `(?(...)x|y)`, the first is treated as the true branch, the second being the false branch.
For consistency with String escape syntax, we intend on canonicalizing to `\u{...}`.
811
+
There are multiple equivalent ways of spelling the same the Unicode scalar value, in either hex, octal, or by spelling the name explicitly. String literals already provide a `\u{...}`syntax that allow a hex sequence for a Unicode scalar. As this is Swift's existing preferred spelling for such a sequence, we consider it to be the preferred spelling in this case too. There may however be value in preserving scalars that are explicitly spelled by name with `\N{...}` for clarity.
793
812
794
813
### Character properties
795
814
815
+
Character properties `\p{...}` have a variety of alternative spellings due to fuzzy matching, Unicode aliases, and shorthand syntax for common Unicode properties. They also may be written using POSIX syntax e.g `[:gc=Whitespace:]`.
816
+
796
817
**TODO: Should we canonicalize on e.g `\p{Script_Extensions=Greek}`? Or prefer the shorthand where we can? Or just avoid canonicalizing?**
797
818
798
819
### Groups
@@ -906,4 +927,4 @@ Note that this proposal regards _syntactic_ support, and does not necessarily me
0 commit comments