Fix and clarify CR LF normalization and CR in string literals

LukasKalbertodt · LukasKalbertodt · commit f4e3da2b4afe · 2025-07-25T14:19:15.000+02:00
This was slightly incorrect before. Relevant commits changing this: - fa56fdb - 27e1ec9 The normalization is not applied repeatedly, so CR LF pairs can still exist. Further, given that the normalization happens before lexing, the part "other than as part of such a string continuation escape" is not useful. Either it was CR LF in the raw input, but has already been transformed already (so the lexical grammar does not see CR). Or there is a surviving CR LF pair after the normalization, which is disallowed tho. Here are two test programs showing this behavior: printf 'fn main() { "a\r\r\n\nb"; }' > code.rs | rustc - Results in: error: bare CR not allowed in string, use `\r` instead --> <anon>:1:15 | 1 | fn main() { "a␍ | ^ | help: escape the character | 1 | fn main() { "a\r | ++ And printf 'fn main() { "a\\\r\r\n\nb"; }' > code.rs | rustc - Results in error: unknown character escape: `\r` --> <anon>:1:16 | 1 | fn main() { "a\␍ | ^ unknown character escape | = help: this is an isolated carriage return; consider checking your editor and version control settings
diff --git a/src/input-format.md b/src/input-format.md
@@ -24,6 +24,7 @@ r[input.crlf]
 ## CRLF normalization
 
 Each pair of characters `U+000D` (CR) immediately followed by `U+000A` (LF) is replaced by a single `U+000A` (LF).
+This happens once, not repeatedly, so after the normalization, there can still exist `U+000D` (CR) immediately followed by `U+000A` (LF) in the input (e.g. if the raw input contained "CR CR LF LF").
 
 Other occurrences of the character `U+000D` (CR) are left in place (they are treated as [whitespace]).
 
diff --git a/src/tokens.md b/src/tokens.md
@@ -60,8 +60,6 @@ Literals are tokens used in [literal expressions].
 
 [^nsets]: The number of `#`s on each side of the same literal must be equivalent.
 
-> [!NOTE]
-> Character and string literal tokens never include the sequence of `U+000D` (CR) immediately followed by `U+000A` (LF): this pair would have been previously transformed into a single `U+000A` (LF).
 
 #### ASCII escapes
 
@@ -198,9 +196,9 @@ which must be _escaped_ by a preceding `U+005C` character (`\`).
 
 r[lex.token.literal.str.linefeed]
 Line-breaks, represented by the  character `U+000A` (LF), are allowed in string literals.
+The character `U+000D` (CR) may not appear in a string literal.
 When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
 See [String continuation escapes] for details.
-The character `U+000D` (CR) may not appear in a string literal other than as part of such a string continuation escape.
 
 r[lex.token.literal.char-escape]
 #### Character escapes
@@ -323,9 +321,9 @@ below.
 
 r[lex.token.str-byte.linefeed]
 Line-breaks, represented by the  character `U+000A` (LF), are allowed in byte string literals.
+The character `U+000D` (CR) may not appear in a byte string literal.
 When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
 See [String continuation escapes] for details.
-The character `U+000D` (CR) may not appear in a byte string literal other than as part of such a string continuation escape.
 
 r[lex.token.str-byte.escape]
 Some additional _escapes_ are available in either byte or non-raw byte string
@@ -429,9 +427,9 @@ permitted within a C string.
 
 r[lex.token.str-c.linefeed]
 Line-breaks, represented by the  character `U+000A` (LF), are allowed in C string literals.
+The character `U+000D` (CR) may not appear in a C string literal.
 When an unescaped `U+005C` character (`\`) occurs immediately before a line break, the line break does not appear in the string represented by the token.
 See [String continuation escapes] for details.
-The character `U+000D` (CR) may not appear in a C string literal other than as part of such a string continuation escape.
 
 r[lex.token.str-c.escape]
 Some additional _escapes_ are available in non-raw C string literals. An escape