Skip to content

Commit f4e3da2

Browse files
Fix and clarify CR LF normalization and CR in string literals
This was slightly incorrect before. Relevant commits changing this: - fa56fdb - 27e1ec9 The normalization is not applied repeatedly, so CR LF pairs can still exist. Further, given that the normalization happens before lexing, the part "other than as part of such a string continuation escape" is not useful. Either it was CR LF in the raw input, but has already been transformed already (so the lexical grammar does not see CR). Or there is a surviving CR LF pair after the normalization, which is disallowed tho. Here are two test programs showing this behavior: printf 'fn main() { "a\r\r\n\nb"; }' > code.rs | rustc - Results in: error: bare CR not allowed in string, use `\r` instead --> <anon>:1:15 | 1 | fn main() { "a␍ | ^ | help: escape the character | 1 | fn main() { "a\r | ++ And printf 'fn main() { "a\\\r\r\n\nb"; }' > code.rs | rustc - Results in error: unknown character escape: `\r` --> <anon>:1:16 | 1 | fn main() { "a\␍ | ^ unknown character escape | = help: this is an isolated carriage return; consider checking your editor and version control settings
1 parent e0625a7 commit f4e3da2

File tree

2 files changed

+4
-5
lines changed

2 files changed

+4
-5
lines changed

src/input-format.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ r[input.crlf]
2424
## CRLF normalization
2525

2626
Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF).
27+
This happens once, not repeatedly, so after the normalization, there can still exist `U+000D` (CR) immediately followed by `U+000A` (LF) in the input (e.g. if the raw input contained "CR CR LF LF").
2728

2829
Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]).
2930

src/tokens.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -60,8 +60,6 @@ Literals are tokens used in [literal expressions].
6060

6161
[^nsets]: The number of `#`s on each side of the same literal must be equivalent.
6262

63-
> [!NOTE]
64-
> Character and string literal tokens never include the sequence of `U+000D` (CR) immediately followed by `U+000A` (LF): this pair would have been previously transformed into a single `U+000A` (LF).
6563

6664
#### ASCII escapes
6765

@@ -198,9 +196,9 @@ which must be _escaped_ by a preceding `U+005C` character (`\`).
198196

199197
r[lex.token.literal.str.linefeed]
200198
Line-breaks, represented by the character `U+000A` (LF), are allowed in string literals.
199+
The character `U+000D` (CR) may not appear in a string literal.
201200
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
202201
See [String continuation escapes] for details.
203-
The character `U+000D` (CR) may not appear in a string literal other than as part of such a string continuation escape.
204202

205203
r[lex.token.literal.char-escape]
206204
#### Character escapes
@@ -323,9 +321,9 @@ below.
323321

324322
r[lex.token.str-byte.linefeed]
325323
Line-breaks, represented by the character `U+000A` (LF), are allowed in byte string literals.
324+
The character `U+000D` (CR) may not appear in a byte string literal.
326325
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
327326
See [String continuation escapes] for details.
328-
The character `U+000D` (CR) may not appear in a byte string literal other than as part of such a string continuation escape.
329327

330328
r[lex.token.str-byte.escape]
331329
Some additional _escapes_ are available in either byte or non-raw byte string
@@ -429,9 +427,9 @@ permitted within a C string.
429427

430428
r[lex.token.str-c.linefeed]
431429
Line-breaks, represented by the character `U+000A` (LF), are allowed in C string literals.
430+
The character `U+000D` (CR) may not appear in a C string literal.
432431
When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
433432
See [String continuation escapes] for details.
434-
The character `U+000D` (CR) may not appear in a C string literal other than as part of such a string continuation escape.
435433

436434
r[lex.token.str-c.escape]
437435
Some additional _escapes_ are available in non-raw C string literals. An escape

0 commit comments

Comments
 (0)