Skip to content

Commit f64e6ae

Browse files
authored
Update RegexSyntax.md - grammar convention
1 parent 11bb57d commit f64e6ae

File tree

1 file changed

+25
-19
lines changed

1 file changed

+25
-19
lines changed

Documentation/Evolution/RegexSyntax.md

Lines changed: 25 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ We propose accepting a syntactic "superset" of the following existing regular ex
3838

3939
To our knowledge, all other popular regex engines support a subset of the above syntaxes.
4040

41-
We also support [UTS#18][uts18]'s full set of character class operators (to our knowledge no other engine does). Beyond that, UTS#18 deals with semantics rather than syntax, and what syntax it uses is covered by the above list. We also parse `\p{javaLowerCase}`, meaning we support a superset of Java 8 as well.
41+
We also support [UTS#18][uts18]'s full set of character class operators (to our knowledge no other engine does). Beyond that, UTS#18 deals with semantics rather than syntax, and what syntax it uses is covered by the above list. We also parse Java's properties (e.g. `\p{javaLowerCase}`), meaning we support a superset of Java 8 as well.
4242

4343
Note that there are minor syntactic incompatibilities and ambiguities involved in this approach. Each is addressed in the relevant sections below
4444

@@ -48,19 +48,25 @@ Regex literal interior syntax will be part of Swift's source-compatibility story
4848

4949
We propose the following syntax for use inside Swift regex literals.
5050

51-
<details><summary>Grammar Conventions</summary>
51+
<details><summary>Grammar Notation</summary>
5252

53-
Elements of the grammar are defined using the syntax `Element -> <Definition>`.
53+
For the grammar sections, we use a modified PEG-like notation, in which the grammar also describes an unambiguous top-down parsing algorithm.
5454

55-
Quoted characters e.g `'abc'`, `"abc"` in the grammar match against the literal characters. Unquoted names e.g `Concatenation` refer to other definitions in the grammar.
55+
- `<Element> -> <Definition>` gives the definition of `Element`
56+
- The `|` operator specifies a choice of alternatives
57+
- `'x'` is the literal character `x`, otherwise it's a reference to x
58+
+ A literal `'` is spelled `"'"`
59+
- Postfix `*` `+` and `?` denote zero-or-more, one-or-more, and zero-or-one
60+
- Range quantifiers, like `{1...4}`, use Swift range syntax as convention.
61+
- Basic custom character classes are written like `[0-9a-zA-Z]`
62+
- Prefix `!` operator means the next element must not appear (a zero-width assertion)
63+
- Parenthesis group for the purposes of quantification
64+
- Builtins use angle brackets:
65+
- `<Int>` refers to an integer, `<Char>` a character, etc.
66+
- `<Space>` is any whitespace character
67+
- `<EOL>` is the end-of-line anchor (e.g. `$` in regex).
5668

57-
The `|` operator is used to specify that the grammar can match against either branch of the operator, similar to a regular expression. Similarly, `*`, `+`, and `?` are used to quantify an element of the grammar, with the same meaning as in regular expressions. Range quantifiers `{...}` may also be used, though we adopt a more explicit syntax that uses the Swift `..<` & `...` operators, e.g `{1...4}`.
58-
59-
Basic custom character classes may appear in the grammar, and have the same meaning as in a regular expression. For example `[0-9a-zA-Z]` expresses the digits `0` to `9` and the letters `a` to `z` (both upper and lowercase).
60-
61-
The `!` prefix operator is used to specify that the following grammar element must not appear at that position.
62-
63-
Grammar elements may be surrounded by parentheses for the purposes of quantification.
69+
For example, `(!'|' !')' ConcatComponent)*` means any number (zero or more) occurrences of `ConcatComponent` so long as the initial character is neither a literal `|` nor a literal `)`.
6470

6571
</details>
6672

@@ -87,8 +93,8 @@ ConcatComponent -> Trivia | Quote | Quantification
8793
Trivia -> Comment | NonSemanticWhitespace
8894
Comment -> '(?#' (!')')* ')' | EndOfLineComment
8995
90-
(extended syntax only) EndOfLineComment -> '#' .*$
91-
(extended syntax only) NonSemanticWhitespace -> \s+
96+
(extended syntax only) EndOfLineComment -> '#' (!<EOL> .)* <EOL>
97+
(extended syntax only) NonSemanticWhitespace -> <Space>+
9298
9399
Quote -> '\Q' (!'\E' .)* '\E'
94100
@@ -212,12 +218,12 @@ Precise definitions of character classes is discussed in [Character Classes for
212218

213219
```
214220
UnicodeScalar -> '\u{' HexDigit{1...} '}'
215-
| '\u' HexDigit{4}
216-
| '\x{' HexDigit{1...} '}'
217-
| '\x' HexDigit{0...2}
218-
| '\U' HexDigit{8}
219-
| '\o{' OctalDigit{1...} '}'
220-
| '\0' OctalDigit{0...3}
221+
| '\u' HexDigit{4}
222+
| '\x{' HexDigit{1...} '}'
223+
| '\x' HexDigit{0...2}
224+
| '\U' HexDigit{8}
225+
| '\o{' OctalDigit{1...} '}'
226+
| '\0' OctalDigit{0...3}
221227
222228
HexDigit -> [0-9a-zA-Z]
223229
OctalDigit -> [0-7]

0 commit comments

Comments
 (0)