Skip to content

Commit c7d556c

Browse files
committed
Elaborate on starting character limitations
1 parent 1049276 commit c7d556c

File tree

1 file changed

+38
-12
lines changed

1 file changed

+38
-12
lines changed

Documentation/Evolution/DelimiterSyntax.md

Lines changed: 38 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ The use of a two letter prefix allows for easy future extensibility of such lite
2525

2626
There are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. Fortunately, alternative syntax exists for all of these constructs, e.g `(?<name>)`, `\k<name>`, and `(?C"arg")`.
2727

28-
As such, the single quote variants of the syntax will be considered invalid in a `re'...'` literal, and users must use the alternative syntax. If a raw variant of the syntax `re#'...'#` of the syntax is later added, that may also be used. In order to improve diagnostic behavior, the compiler will attempt to scan ahead when encountering the ending sequences `(?`, `(?(`, `\g`, `\k` and `(?C`. This will enable a more accurate error to be emitted that suggests the alternative syntax.
28+
As such, the single quote variants of the syntax will be considered invalid in a `re'...'` literal, and users must use the alternative syntax instead. If a raw variant of the syntax `re#'...'#` of the syntax is later added, that may also be used. In order to improve diagnostic behavior, the compiler will attempt to scan ahead when encountering the ending sequences `(?`, `(?(`, `\g`, `\k` and `(?C`. This will enable a more accurate error to be emitted that suggests the alternative syntax.
2929

3030
## Future Directions
3131

@@ -35,7 +35,7 @@ The `re'...'` syntax could be naturally extended to supporting "raw text" throug
3535

3636
In particular:
3737

38-
- `\` and `'` characters would become literal, e.g `re#''\n''#` expresses a regular expression pattern that literally matches against the characters `'\n'` (including the quotes).
38+
- `\` and `'` characters would become literal, e.g `re#''\n''#` expresses a regular expression pattern that literally matches against the characters `'\n'` (including the quotes). **TODO: Do we really want to treat backslash as literal? Seems consistent, but escape sequences are frequently used in regex.**
3939
- Any number of `#` characters may surround the literal.
4040
- Escape sequences would require the same number of `#` characters as in the delimiter to be treated specially. For example, `re##'\##n'##` would be required for a newline character sequence.
4141

@@ -70,17 +70,17 @@ Forward slashes are a regex term of art, and are used as the delimiters for rege
7070

7171
The obvious parsing ambiguity with `/.../` delimiters is with comment syntaxes.
7272

73-
- An empty regex literal would conflict with line comment syntax `//`. But this isn't a particularly useful thing to express, and could be disallowed.
73+
- An empty regex literal would conflict with line comment syntax `//`. But this isn't a particularly useful thing to express, and can therefore be disallowed without significant impact.
7474

7575
- The obvious choice for a multi-line regular expression literal would be to use `///` delimiters, in accordance with the precedent set by multi-line string literals `"""`. A different multi-line delimiter would be needed, with no obvious choice.
7676

77-
- There is also a conflict with block comment syntax, when surrounding a regex literal ending with `*`, for example:
77+
- There is a conflict with block comment syntax, when surrounding a regex literal ending with `*`, for example:
7878

79-
```swift
80-
/*
81-
let regex = /x*/
82-
*/
83-
```
79+
```swift
80+
/*
81+
let regex = /x*/
82+
*/
83+
```
8484

8585
In this case, the block comment would prematurely end on the second line, rather than extending all the way to the third line as the user would expect. This is already an issue today with `*/` in a string literal, however it is much more likely to occur in a regular expression given the prevalence of the `*` quantifier.
8686

@@ -90,7 +90,11 @@ let regex = /x*/
9090

9191
#### Regex limitations
9292

93-
Another ambiguity with `/.../` arises when it is used to start a new line. This is particularly problematic for result builders, where we expect it to be frequently used, for example:
93+
In order to help avoid parsing ambiguities, a regex literal will not be parsed if it starts with a space, tab, or `)` character. Though the latter is already invalid regex syntax.
94+
95+
<details><summary>Rationale</summary>
96+
97+
This is due to 2 main ambiguities. The first of which arises when a `/.../` regex literal is used to start a new line. This is particularly problematic for result builders, where we expect it to be frequently used, for example:
9498

9599
```swift
96100
Builder {
@@ -100,7 +104,7 @@ Builder {
100104
}
101105
```
102106

103-
This is parsed as a single operator chain, however it is likely the user is expecting a regex literal. To resolve this ambiguity, a regex literal may not start with a space or tab character. This takes advantage of the fact that infix operators require consistent spacing.
107+
This is parsed as a single operator chain, however it is likely the user is expecting a regex literal. To resolve this ambiguity, a regex literal may not start with a space or tab character. This takes advantage of the fact that infix operators require consistent spacing on either side.
104108

105109
If a space or tab is needed as the first character, it must be escaped, e.g:
106110

@@ -112,7 +116,27 @@ Builder {
112116
}
113117
```
114118

115-
**TODO: Regex starting with `)`**
119+
The second ambiguity arises with Swift's ability to pass an unapplied operator reference as an argument to a function, for example:
120+
121+
```swift
122+
let arr: [Double] = [2, 3, 4]
123+
let x = arr.reduce(1, /) / 5
124+
```
125+
126+
The `/` in the call to `reduce` is in a valid expression context, and as such could be passed as a regular expression literal. To help mitigate this ambiguity, a regex literal will not be parsed if the first character is `)`. Note this would not be valid regex syntax anyway.
127+
128+
This is also applicable to unapplied operator references in parentheses and tuples.
129+
130+
It should be noted that this only mitigates the issue, as another ambiguity arises if the next character is a comma:
131+
132+
```swift
133+
func foo(_ x: (Int, Int) -> Int, _ y: (Int, Int) -> Int) {}
134+
foo(/, /)
135+
```
136+
137+
However we feel that starting a regex with a comma is likely to be a common case, and as such we intend to change the parser such that the above becomes a regex literal.
138+
139+
</details>
116140

117141
#### Language changes required
118142

@@ -161,6 +185,8 @@ foo(/, /)
161185

162186
**TODO: Or do we want to ban it as the starting character?**
163187

188+
</details>
189+
164190
#### Editor Considerations
165191

166192
As described above, there would be a lot involved in handling the parsing ambiguities with `/.../` delimiters. It's one thing to do this in the compiler. But the language also has to be understood by a plethora of source code editors. Those editors either need encode all those ambiguities, or they need to provide a "best effort" at handling the most common cases. It's all too common for editors to take the "best effort" route. There's a long history of complaints with editors that don't completely support a language's features. And indeed, there's plenty of history of editors that don't correctly support regular expression literals in other languages. By choosing a literal that is easily parsed, we should avoid seeing those complaints regarding Swift.

0 commit comments

Comments
 (0)