Skip to content

Commit 1c2b7ad

Browse files
committed
Flip pitch to /.../ as the main syntax
A quick pass to flip `/.../` out of the alternatives and into the main syntax. Still needs a bunch of work. Also add some commentary on a regex with `]` as the starting character.
1 parent bcebfc6 commit 1c2b7ad

File tree

1 file changed

+81
-72
lines changed

1 file changed

+81
-72
lines changed

Documentation/Evolution/DelimiterSyntax.md

Lines changed: 81 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -12,61 +12,22 @@
1212

1313
## Detailed Design
1414

15-
A regular expression literal will be introduced using `re'...'` delimiters, within which the compiler will parse a regular expression (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]):
15+
**TODO: Say that this is Swift 6 syntax only, `#/.../#` would be 5.7 syntax**
16+
17+
A regular expression literal will be introduced using `/.../` delimiters, within which the compiler will parse a regular expression (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]):
1618

1719
```
1820
// Matches "<identifier> = <hexadecimal value>", extracting the identifier and hex number
19-
let regex = re'([[:alpha:]]\w*) = ([0-9A-F]+)'
21+
let regex = /([[:alpha:]]\w*) = ([0-9A-F]+)/
2022
```
2123

22-
The use of a two letter prefix allows for easy future extensibility of such literals, by allowing different prefixes to indicate different types of literal. **TODO: examples**
23-
24-
### Regex syntax limitations
25-
26-
There are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. Fortunately, alternative syntax exists for all of these constructs, e.g `(?<name>)`, `\k<name>`, and `(?C"arg")`.
27-
28-
As such, the single quote variants of the syntax will be considered invalid in a `re'...'` literal, and users must use the alternative syntax instead. If a raw variant of the syntax `re#'...'#` of the syntax is later added, that may also be used. In order to improve diagnostic behavior, the compiler will attempt to scan ahead when encountering the ending sequences `(?`, `(?(`, `\g`, `\k` and `(?C`. This will enable a more accurate error to be emitted that suggests the alternative syntax.
29-
30-
## Future Directions
31-
32-
### Raw literals
24+
Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternative choices). Due to its existing use in comment syntax and operators, there are some syntactic ambiguities to consider. While there are quite a few cases to consider, we do not feel that the impact of any individual case is particularly high.
3325

34-
The `re'...'` syntax could be naturally extended to supporting "raw text" through allowing additional `#` characters to surround the quote characters e.g `re#'...'#`. Such literals would follow the same rules as the string literals introduced in [SE-0200].
26+
**TODO: Do we want to present a stronger argument for `/.../`?**
3527

36-
In particular:
28+
**TODO: Anything else we want to say here before segueing into the massive list?**
3729

38-
- `\` and `'` characters would become literal, e.g `re#''\n''#` expresses a regular expression pattern that literally matches against the characters `'\n'` (including the quotes). **TODO: Do we really want to treat backslash as literal? Seems consistent, but escape sequences are frequently used in regex.**
39-
- Any number of `#` characters may surround the literal.
40-
- Escape sequences would require the same number of `#` characters as in the delimiter to be treated specially. For example, `re##'\##n'##` would be required for a newline character sequence.
41-
42-
### Multi-line literals
43-
44-
A natural extension to the `re'...'` syntax to support multi-line regex literals would be to allow triple quote syntax:
45-
46-
```
47-
re'''
48-
abc
49-
def
50-
'''
51-
```
52-
53-
This would follow the precedent set by [SE-0168] for multi-line string literals, and obey the same rules, in particular with the stripping of any leading whitespace prior to the position of the closing delimiter.
54-
55-
## Alternatives Considered
56-
57-
### Double quoted `re"...."`
58-
59-
We could choose to use double quotes instead of single quotes. This would be similar in appearance to string literals, however it could be argued that regex literals are distinct from string literals in that they introduce their own specific language to parse. As such, regex literals are more like "program literals" than "data literals", and the use of single quote instead of double quote could express this difference.
60-
61-
### Single letter `r'...'`
62-
63-
We could choose to shorten the literal prefix to just `r`. However this could potentially be confused to mean "raw", especially as Python uses this syntax for raw strings. The syntax `re'...'` could also set the precedent for a 2 letter namespace for future literals.
64-
65-
### Forward slashes `/.../`
66-
67-
Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternative choices). However, they would be an awkward fit in Swift's language grammar, and would not provide a path for extensibility. Here we give an extensive list of drawbacks to the choice. While no individual issue is terribly bad and each could be overcome, the list of issues is quite long.
68-
69-
#### Parsing ambiguities
30+
### Parsing ambiguities
7031

7132
The obvious parsing ambiguity with `/.../` delimiters is with comment syntaxes.
7233

@@ -88,7 +49,7 @@ The obvious parsing ambiguity with `/.../` delimiters is with comment syntaxes.
8849

8950
- Finally, there would be a minor ambiguity with infix operators used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required `x + /y/` for regex literal interpretation.
9051

91-
#### Regex syntax limitations
52+
### Regex syntax limitations
9253

9354
In order to help avoid further parsing ambiguities, a regex literal will not be parsed if it starts with a space, tab, or `)` character. Though the latter is already invalid regex syntax.
9455

@@ -125,27 +86,20 @@ let x = arr.reduce(1, /) / 5
12586

12687
The `/` in the call to `reduce` is in a valid expression context, and as such could be parsed as a regex literal. This is also applicable to operators in tuples and parentheses. To help mitigate this ambiguity, a regex literal will not be parsed if the first character is `)`. This should have minimal impact, as this would not be valid regex syntax anyway.
12788

128-
It should be noted that this only mitigates the issue, as another ambiguity arises if the next character is a comma:
129-
130-
```swift
131-
func foo(_ x: (Int, Int) -> Int, _ y: (Int, Int) -> Int) {}
132-
foo(/, /)
133-
```
134-
135-
However we feel that starting a regex with a comma is likely to be a common case, and as such we intend to change the parser such that the above becomes a regex literal.
89+
It should be noted that this only mitigates the issue, as it does not handle the case where the next character is a comma or right square bracket. These cases are explored further in the following section.
13690

13791
</details>
13892

139-
#### Language changes required
93+
### Language changes required
14094

141-
In addition to ambiguities listed above, there are also some parsing ambiguities that would require the following language changes:
95+
In addition to ambiguities listed above, there are also some parsing ambiguities that would require the following language changes in Swift 6 mode:
14296

14397
- Deprecation of prefix operators containing the `/` character.
144-
- Parsing `/,` as the start of a regex literal if a closing `/` is found, rather than an unapplied operator in an argument list. For example, `fn(/, /)` becomes a regex literal rather than 2 unapplied operator arguments. **TODO: Or do we want to ban it as the starting character? Seems like a common regex case**
98+
- Parsing `/,` and `/]` as the start of a regex literal if a closing `/` is found, rather than an unapplied operator in an argument list. For example, `fn(/, /)` becomes a regex literal rather than 2 unapplied operator arguments.
14599

146100
<details><summary>Rationale</summary>
147101

148-
##### Prefix operators starting with `/`
102+
#### Prefix operators starting with `/`
149103

150104
We'd need to ban prefix operators starting with `/`, to avoid ambiguity with cases such as:
151105

@@ -156,7 +110,7 @@ let z = /^x^/
156110

157111
Postfix `/` operators would be okay, as they'd only be treated as regex literal delimiters if we were already trying to lex as a regex literal.
158112

159-
##### Prefix operators containing `/`
113+
#### Prefix operators containing `/`
160114

161115
Prefix operators *containing* `/` (not just at the start) would likely need banning too, in order to allow prefix operators to be used with regex literals in an unambiguous way, e.g:
162116

@@ -166,49 +120,104 @@ let x = !/y / .foo()
166120

167121
Otherwise it would be interpreted as the prefix operator `!/` by default, and require parens `!(/y /)` for regex parsing.
168122

169-
##### Comma as the starting character of a regex literal
123+
#### `/,` and `/]` as regex literal openings
170124

171-
As stated previously, there is a parsing ambiguity with unapplied operators in argument lists, tuples, and parentheses. Some of these cases can be mitigated by not parsing a regex literal if the starting character is `)`. However it does not solve the issue when the next character is `,`, i.e `/` is used in an argument list before another argument.
125+
As stated previously, there is a parsing ambiguity with unapplied operators in argument lists, tuples, and parentheses. Some of these cases can be mitigated by not parsing a regex literal if the starting character is `)`. However it does not solve the issue when the next character is `,` or `]`. Both of these are valid regex starting characters, and comma in particular may be a fairly common case for a regex.
172126

173127
For example:
174128

175129
```swift
130+
// Ambiguity with comma:
176131
func foo(_ x: (Int, Int) -> Int, _ y: (Int, Int) -> Int) {}
177132
foo(/, /)
178-
```
179-
180-
This is currently parsed as 2 unapplied operator arguments. However, given the fact that a regex starting with a comma is not an uncommon case, this will become a regex literal.
181133

182-
The above case seems uncommon, however note this may also occur when the closing `/` appears outside of the argument list, e.g:
134+
// Also affects cases where the closing '/' is outside the argument list.
135+
func bar(_ fn: (Int, Int) -> Int, _ x: Int) -> Int { 0 }
136+
bar(/, 2) + bar(/, 3)
183137

184-
```swift
185-
foo(/, 2) + foo(/, 3)
138+
// Ambiguity with right square bracket:
139+
struct S {
140+
subscript(_ fn: (Int, Int) -> Int) -> Int { 0 }
141+
}
142+
func baz(_ x: S) -> Int {
143+
x[/] + x[/]
144+
}
186145
```
187146

147+
`foo(/, /)` is currently parsed as 2 unapplied operator arguments. `bar(/, 2) + bar(/, 3)` is currently parsed as two independent calls that each take an unapplied `/` operator reference. Both of these would become regex literals arguments, `/, /` and `/, 2) + bar(/` respectively (though the latter would produce a regex error).
148+
188149
**TODO: Do we want to talk about a heuristic that looks for unbalanced parens? I'm kind of hesitant to implement that, as it would have edge cases and might screw with regex errors that should be diagnosed as invalid regex, rather than some cryptic Swift syntactic error. Which would also make it harder to explain to users.**
189150

190-
This would also become a regex literal, i.e it would be parsed as the argument `/, 2) + foo(/`. If users wish to disambiguate, they will need to surround at least the opening `/` with parentheses, e.g:
151+
To disambiguate these cases, users will need to surround at least the opening `/` with parentheses, e.g:
191152

192153
```swift
193-
foo((/), 2) + foo(/, 3)
154+
foo((/), /)
155+
bar((/), 2) + bar(/, 3)
156+
157+
func baz(_ x: S) -> Int {
158+
x[(/)] + x[/]
159+
}
194160
```
195161

196162
This takes advantage of the fact that a regex literal will not be parsed if the first character is `)`.
197163

198164
</details>
199165

200-
#### Editor Considerations
166+
### Editor Considerations
167+
168+
**TODO: Rewrite now that `/.../` is the syntax being pitched?**
201169

202170
As described above, there would be a lot involved in handling the parsing ambiguities with `/.../` delimiters. It's one thing to do this in the compiler. But the language also has to be understood by a plethora of source code editors. Those editors either need encode all those ambiguities, or they need to provide a "best effort" at handling the most common cases. It's all too common for editors to take the "best effort" route. There's a long history of complaints with editors that don't completely support a language's features. And indeed, there's plenty of history of editors that don't correctly support regular expression literals in other languages. By choosing a literal that is easily parsed, we should avoid seeing those complaints regarding Swift.
203171

172+
204173
### Pound slash `#/.../#`
205174

175+
**TODO: This needs to be rewritten to say that it's a transition syntax**
176+
206177
This would be less syntactically ambiguous than `/.../`, while retaining some of the term-of-art familiarity. It would also provide a natural path through which to introduce `/.../` in a new language mode, as users could drop the `#` characters once they upgrade.
207178

208179
However this option would also have the same block comment issue as `/.../` where e.g `#/x*/#` nested inside a block comment would prematurely end. Similarly, it's not clear how a multi-line version of the literal would be spelled.
209180

210181
Additionally, introducing this syntax would introduce an inconsistency with raw string literal syntax, as `#/.../#` on its own would not treat backslashes as literal, unlike `#"..."#`. If raw regex syntax were implemented, it would start at `##/.../##`. With raw strings, escape sequences must use the same number of `#`s as the delimiter, e.g `#"\#n"#` for a newline. However for raw regex literals it would be one fewer `#` than the delimiter e.g `##/\#n/##`.
211182

183+
## Future Directions
184+
185+
**TODO: What do we want to say here?**
186+
187+
## Alternatives Considered
188+
189+
### Prefixed quote `re'...'`
190+
191+
**TODO: Do a pass over this to make sure it sounds correct now that it's an alternative**
192+
193+
We could choose to use `re'...'` delimiters, for example:
194+
195+
```
196+
// Matches "<identifier> = <hexadecimal value>", extracting the identifier and hex number
197+
let regex = re'([[:alpha:]]\w*) = ([0-9A-F]+)'
198+
```
199+
200+
**TODO: Fill in reasons why not to pick this**
201+
202+
**TODO: Mention that it nicely extends to raw and multiline?**
203+
204+
#### Regex syntax limitations
205+
206+
There are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. Fortunately, alternative syntax exists for all of these constructs, e.g `(?<name>)`, `\k<name>`, and `(?C"arg")`.
207+
208+
As such, the single quote variants of the syntax would be considered invalid in a `re'...'` literal, and users must use the alternative syntax instead. If a raw variant of the syntax `re#'...'#` of the syntax is later added, that may also be used. In order to improve diagnostic behavior, the compiler would attempt to scan ahead when encountering the ending sequences `(?`, `(?(`, `\g`, `\k` and `(?C`. This would enable a more accurate error to be emitted that suggests the alternative syntax.
209+
210+
**TODO: Do we actually want to include the below? They're less relevant if `re'...'` is itself the alternative**
211+
212+
### Double quoted `re"...."`
213+
214+
We could choose to use double quotes instead of single quotes. This would be similar in appearance to string literals, however it could be argued that regex literals are distinct from string literals in that they introduce their own specific language to parse. As such, regex literals are more like "program literals" than "data literals", and the use of single quote instead of double quote could express this difference.
215+
216+
### Single letter `r'...'`
217+
218+
We could choose to shorten the literal prefix to just `r`. However this could potentially be confused to mean "raw", especially as Python uses this syntax for raw strings. The syntax `re'...'` could also set the precedent for a 2 letter namespace for future literals.
219+
220+
**TODO: Add the other alternatives e.g `#regex(...)`**
212221

213222
[SE-0168]: https://github.com/apple/swift-evolution/blob/main/proposals/0168-multi-line-string-literals.md
214223
[SE-0200]: https://github.com/apple/swift-evolution/blob/main/proposals/0200-raw-string-escaping.md

0 commit comments

Comments
 (0)