You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A quick pass to flip `/.../` out of the alternatives
and into the main syntax. Still needs a bunch of
work.
Also add some commentary on a regex with `]` as the
starting character.
Copy file name to clipboardExpand all lines: Documentation/Evolution/DelimiterSyntax.md
+81-72Lines changed: 81 additions & 72 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,61 +12,22 @@
12
12
13
13
## Detailed Design
14
14
15
-
A regular expression literal will be introduced using `re'...'` delimiters, within which the compiler will parse a regular expression (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]):
15
+
**TODO: Say that this is Swift 6 syntax only, `#/.../#` would be 5.7 syntax**
16
+
17
+
A regular expression literal will be introduced using `/.../` delimiters, within which the compiler will parse a regular expression (the details of which are outlined in [the Regex Syntax pitch][internal-syntax]):
16
18
17
19
```
18
20
// Matches "<identifier> = <hexadecimal value>", extracting the identifier and hex number
19
-
let regex = re'([[:alpha:]]\w*) = ([0-9A-F]+)'
21
+
let regex = /([[:alpha:]]\w*) = ([0-9A-F]+)/
20
22
```
21
23
22
-
The use of a two letter prefix allows for easy future extensibility of such literals, by allowing different prefixes to indicate different types of literal. **TODO: examples**
23
-
24
-
### Regex syntax limitations
25
-
26
-
There are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. Fortunately, alternative syntax exists for all of these constructs, e.g `(?<name>)`, `\k<name>`, and `(?C"arg")`.
27
-
28
-
As such, the single quote variants of the syntax will be considered invalid in a `re'...'` literal, and users must use the alternative syntax instead. If a raw variant of the syntax `re#'...'#` of the syntax is later added, that may also be used. In order to improve diagnostic behavior, the compiler will attempt to scan ahead when encountering the ending sequences `(?`, `(?(`, `\g`, `\k` and `(?C`. This will enable a more accurate error to be emitted that suggests the alternative syntax.
29
-
30
-
## Future Directions
31
-
32
-
### Raw literals
24
+
Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternative choices). Due to its existing use in comment syntax and operators, there are some syntactic ambiguities to consider. While there are quite a few cases to consider, we do not feel that the impact of any individual case is particularly high.
33
25
34
-
The `re'...'` syntax could be naturally extended to supporting "raw text" through allowing additional `#` characters to surround the quote characters e.g `re#'...'#`. Such literals would follow the same rules as the string literals introduced in [SE-0200].
26
+
**TODO: Do we want to present a stronger argument for `/.../`?**
35
27
36
-
In particular:
28
+
**TODO: Anything else we want to say here before segueing into the massive list?**
37
29
38
-
-`\` and `'` characters would become literal, e.g `re#''\n''#` expresses a regular expression pattern that literally matches against the characters `'\n'` (including the quotes). **TODO: Do we really want to treat backslash as literal? Seems consistent, but escape sequences are frequently used in regex.**
39
-
- Any number of `#` characters may surround the literal.
40
-
- Escape sequences would require the same number of `#` characters as in the delimiter to be treated specially. For example, `re##'\##n'##` would be required for a newline character sequence.
41
-
42
-
### Multi-line literals
43
-
44
-
A natural extension to the `re'...'` syntax to support multi-line regex literals would be to allow triple quote syntax:
45
-
46
-
```
47
-
re'''
48
-
abc
49
-
def
50
-
'''
51
-
```
52
-
53
-
This would follow the precedent set by [SE-0168] for multi-line string literals, and obey the same rules, in particular with the stripping of any leading whitespace prior to the position of the closing delimiter.
54
-
55
-
## Alternatives Considered
56
-
57
-
### Double quoted `re"...."`
58
-
59
-
We could choose to use double quotes instead of single quotes. This would be similar in appearance to string literals, however it could be argued that regex literals are distinct from string literals in that they introduce their own specific language to parse. As such, regex literals are more like "program literals" than "data literals", and the use of single quote instead of double quote could express this difference.
60
-
61
-
### Single letter `r'...'`
62
-
63
-
We could choose to shorten the literal prefix to just `r`. However this could potentially be confused to mean "raw", especially as Python uses this syntax for raw strings. The syntax `re'...'` could also set the precedent for a 2 letter namespace for future literals.
64
-
65
-
### Forward slashes `/.../`
66
-
67
-
Forward slashes are a regex term of art, and are used as the delimiters for regex literals in Perl, JavaScript and Ruby (though Perl and Ruby also provide alternative choices). However, they would be an awkward fit in Swift's language grammar, and would not provide a path for extensibility. Here we give an extensive list of drawbacks to the choice. While no individual issue is terribly bad and each could be overcome, the list of issues is quite long.
68
-
69
-
#### Parsing ambiguities
30
+
### Parsing ambiguities
70
31
71
32
The obvious parsing ambiguity with `/.../` delimiters is with comment syntaxes.
72
33
@@ -88,7 +49,7 @@ The obvious parsing ambiguity with `/.../` delimiters is with comment syntaxes.
88
49
89
50
- Finally, there would be a minor ambiguity with infix operators used with regex literals. When used without whitespace, e.g `x+/y/`, the expression will be treated as using an infix operator `+/`. Whitespace is therefore required `x + /y/` for regex literal interpretation.
90
51
91
-
####Regex syntax limitations
52
+
### Regex syntax limitations
92
53
93
54
In order to help avoid further parsing ambiguities, a regex literal will not be parsed if it starts with a space, tab, or `)` character. Though the latter is already invalid regex syntax.
94
55
@@ -125,27 +86,20 @@ let x = arr.reduce(1, /) / 5
125
86
126
87
The `/` in the call to `reduce` is in a valid expression context, and as such could be parsed as a regex literal. This is also applicable to operators in tuples and parentheses. To help mitigate this ambiguity, a regex literal will not be parsed if the first character is `)`. This should have minimal impact, as this would not be valid regex syntax anyway.
127
88
128
-
It should be noted that this only mitigates the issue, as another ambiguity arises if the next character is a comma:
However we feel that starting a regex with a comma is likely to be a common case, and as such we intend to change the parser such that the above becomes a regex literal.
89
+
It should be noted that this only mitigates the issue, as it does not handle the case where the next character is a comma or right square bracket. These cases are explored further in the following section.
136
90
137
91
</details>
138
92
139
-
####Language changes required
93
+
### Language changes required
140
94
141
-
In addition to ambiguities listed above, there are also some parsing ambiguities that would require the following language changes:
95
+
In addition to ambiguities listed above, there are also some parsing ambiguities that would require the following language changes in Swift 6 mode:
142
96
143
97
- Deprecation of prefix operators containing the `/` character.
144
-
- Parsing `/,` as the start of a regex literal if a closing `/` is found, rather than an unapplied operator in an argument list. For example, `fn(/, /)` becomes a regex literal rather than 2 unapplied operator arguments.**TODO: Or do we want to ban it as the starting character? Seems like a common regex case**
98
+
- Parsing `/,`and `/]`as the start of a regex literal if a closing `/` is found, rather than an unapplied operator in an argument list. For example, `fn(/, /)` becomes a regex literal rather than 2 unapplied operator arguments.
145
99
146
100
<details><summary>Rationale</summary>
147
101
148
-
#####Prefix operators starting with `/`
102
+
#### Prefix operators starting with `/`
149
103
150
104
We'd need to ban prefix operators starting with `/`, to avoid ambiguity with cases such as:
151
105
@@ -156,7 +110,7 @@ let z = /^x^/
156
110
157
111
Postfix `/` operators would be okay, as they'd only be treated as regex literal delimiters if we were already trying to lex as a regex literal.
158
112
159
-
#####Prefix operators containing `/`
113
+
#### Prefix operators containing `/`
160
114
161
115
Prefix operators *containing*`/` (not just at the start) would likely need banning too, in order to allow prefix operators to be used with regex literals in an unambiguous way, e.g:
162
116
@@ -166,49 +120,104 @@ let x = !/y / .foo()
166
120
167
121
Otherwise it would be interpreted as the prefix operator `!/` by default, and require parens `!(/y /)` for regex parsing.
168
122
169
-
##### Comma as the starting character of a regex literal
123
+
####`/,` and `/]` as regex literal openings
170
124
171
-
As stated previously, there is a parsing ambiguity with unapplied operators in argument lists, tuples, and parentheses. Some of these cases can be mitigated by not parsing a regex literal if the starting character is `)`. However it does not solve the issue when the next character is `,`, i.e `/` is used in an argument list before another argument.
125
+
As stated previously, there is a parsing ambiguity with unapplied operators in argument lists, tuples, and parentheses. Some of these cases can be mitigated by not parsing a regex literal if the starting character is `)`. However it does not solve the issue when the next character is `,` or `]`. Both of these are valid regex starting characters, and comma in particular may be a fairly common case for a regex.
This is currently parsed as 2 unapplied operator arguments. However, given the fact that a regex starting with a comma is not an uncommon case, this will become a regex literal.
181
133
182
-
The above case seems uncommon, however note this may also occur when the closing `/` appears outside of the argument list, e.g:
134
+
// Also affects cases where the closing '/' is outside the argument list.
`foo(/, /)` is currently parsed as 2 unapplied operator arguments. `bar(/, 2) + bar(/, 3)` is currently parsed as two independent calls that each take an unapplied `/` operator reference. Both of these would become regex literals arguments, `/, /` and `/, 2) + bar(/` respectively (though the latter would produce a regex error).
148
+
188
149
**TODO: Do we want to talk about a heuristic that looks for unbalanced parens? I'm kind of hesitant to implement that, as it would have edge cases and might screw with regex errors that should be diagnosed as invalid regex, rather than some cryptic Swift syntactic error. Which would also make it harder to explain to users.**
189
150
190
-
This would also become a regex literal, i.e it would be parsed as the argument `/, 2) + foo(/`. If users wish to disambiguate, they will need to surround at least the opening `/` with parentheses, e.g:
151
+
To disambiguate these cases, users will need to surround at least the opening `/` with parentheses, e.g:
191
152
192
153
```swift
193
-
foo((/), 2) +foo(/, 3)
154
+
foo((/), /)
155
+
bar((/), 2) +bar(/, 3)
156
+
157
+
funcbaz(_x: S) ->Int {
158
+
x[(/)] + x[/]
159
+
}
194
160
```
195
161
196
162
This takes advantage of the fact that a regex literal will not be parsed if the first character is `)`.
197
163
198
164
</details>
199
165
200
-
#### Editor Considerations
166
+
### Editor Considerations
167
+
168
+
**TODO: Rewrite now that `/.../` is the syntax being pitched?**
201
169
202
170
As described above, there would be a lot involved in handling the parsing ambiguities with `/.../` delimiters. It's one thing to do this in the compiler. But the language also has to be understood by a plethora of source code editors. Those editors either need encode all those ambiguities, or they need to provide a "best effort" at handling the most common cases. It's all too common for editors to take the "best effort" route. There's a long history of complaints with editors that don't completely support a language's features. And indeed, there's plenty of history of editors that don't correctly support regular expression literals in other languages. By choosing a literal that is easily parsed, we should avoid seeing those complaints regarding Swift.
203
171
172
+
204
173
### Pound slash `#/.../#`
205
174
175
+
**TODO: This needs to be rewritten to say that it's a transition syntax**
176
+
206
177
This would be less syntactically ambiguous than `/.../`, while retaining some of the term-of-art familiarity. It would also provide a natural path through which to introduce `/.../` in a new language mode, as users could drop the `#` characters once they upgrade.
207
178
208
179
However this option would also have the same block comment issue as `/.../` where e.g `#/x*/#` nested inside a block comment would prematurely end. Similarly, it's not clear how a multi-line version of the literal would be spelled.
209
180
210
181
Additionally, introducing this syntax would introduce an inconsistency with raw string literal syntax, as `#/.../#` on its own would not treat backslashes as literal, unlike `#"..."#`. If raw regex syntax were implemented, it would start at `##/.../##`. With raw strings, escape sequences must use the same number of `#`s as the delimiter, e.g `#"\#n"#` for a newline. However for raw regex literals it would be one fewer `#` than the delimiter e.g `##/\#n/##`.
211
182
183
+
## Future Directions
184
+
185
+
**TODO: What do we want to say here?**
186
+
187
+
## Alternatives Considered
188
+
189
+
### Prefixed quote `re'...'`
190
+
191
+
**TODO: Do a pass over this to make sure it sounds correct now that it's an alternative**
192
+
193
+
We could choose to use `re'...'` delimiters, for example:
194
+
195
+
```
196
+
// Matches "<identifier> = <hexadecimal value>", extracting the identifier and hex number
197
+
let regex = re'([[:alpha:]]\w*) = ([0-9A-F]+)'
198
+
```
199
+
200
+
**TODO: Fill in reasons why not to pick this**
201
+
202
+
**TODO: Mention that it nicely extends to raw and multiline?**
203
+
204
+
#### Regex syntax limitations
205
+
206
+
There are a few items of regex grammar that use the single quote character as a metacharacter. These include named group definitions and references such as `(?'name')`, `(?('name'))`, `\g'name'`, `\k'name'`, as well as callout syntax `(?C'arg')`. The use of a single quote conflicts with the `re'...'` delimiter as it will be considered the end of the literal. Fortunately, alternative syntax exists for all of these constructs, e.g `(?<name>)`, `\k<name>`, and `(?C"arg")`.
207
+
208
+
As such, the single quote variants of the syntax would be considered invalid in a `re'...'` literal, and users must use the alternative syntax instead. If a raw variant of the syntax `re#'...'#` of the syntax is later added, that may also be used. In order to improve diagnostic behavior, the compiler would attempt to scan ahead when encountering the ending sequences `(?`, `(?(`, `\g`, `\k` and `(?C`. This would enable a more accurate error to be emitted that suggests the alternative syntax.
209
+
210
+
**TODO: Do we actually want to include the below? They're less relevant if `re'...'` is itself the alternative**
211
+
212
+
### Double quoted `re"...."`
213
+
214
+
We could choose to use double quotes instead of single quotes. This would be similar in appearance to string literals, however it could be argued that regex literals are distinct from string literals in that they introduce their own specific language to parse. As such, regex literals are more like "program literals" than "data literals", and the use of single quote instead of double quote could express this difference.
215
+
216
+
### Single letter `r'...'`
217
+
218
+
We could choose to shorten the literal prefix to just `r`. However this could potentially be confused to mean "raw", especially as Python uses this syntax for raw strings. The syntax `re'...'` could also set the precedent for a 2 letter namespace for future literals.
219
+
220
+
**TODO: Add the other alternatives e.g `#regex(...)`**
0 commit comments