Fix type soundness for lone-surrogate string literals#3912
Conversation
Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/408ccedd-6e7b-4112-befe-36f10f374c56 Co-authored-by: jakebailey <5341706+jakebailey@users.noreply.github.com>
Use CESU-8 encoding for lone surrogates in string literal scanning so different surrogates remain distinct. Add CESU-8 surrogate detection in the printer's escapeStringWorker to properly render them as \uXXXX escape sequences in output. Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/408ccedd-6e7b-4112-befe-36f10f374c56 Co-authored-by: jakebailey <5341706+jakebailey@users.noreply.github.com>
|
There's a few places that rely on text from strings that we may want adjusted as well. For example, |
There was a problem hiding this comment.
Pull request overview
This PR fixes handling of lone-surrogate string literal values so distinct surrogate code units no longer collapse to U+FFFD in type checking and declaration/type output.
Changes:
- Encodes lone
\uXXXXsurrogate escapes as CESU-8 sentinel bytes in the scanner. - Updates printer escaping to recognize CESU-8 surrogate sentinels and emit
\uXXXX. - Adds a compiler regression test and updates affected baselines.
Show a summary per file
| File | Description |
|---|---|
internal/scanner/scanner.go |
Preserves lone surrogate escapes as distinct internal values. |
internal/printer/utilities.go |
Prints CESU-8 surrogate sentinels as UTF-16 escape sequences. |
testdata/tests/cases/compiler/loneSurrogateStringLiterals.ts |
Adds regression coverage for surrogate literal assignability. |
testdata/baselines/reference/compiler/loneSurrogateStringLiterals.errors.txt |
Adds expected assignability diagnostics. |
testdata/baselines/reference/compiler/loneSurrogateStringLiterals.js |
Adds JS emit baseline for the new test. |
testdata/baselines/reference/compiler/loneSurrogateStringLiterals.symbols |
Adds symbol baseline for the new test. |
testdata/baselines/reference/compiler/loneSurrogateStringLiterals.types |
Adds type baseline for the new test. |
testdata/baselines/reference/compiler/unicodeSurrogatesInStringLiterals.js |
Updates declaration emit baseline for lone surrogates. |
testdata/baselines/reference/compiler/unicodeSurrogatesInStringLiterals.types |
Updates type display baseline for lone surrogates. |
Copilot's findings
- Files reviewed: 9/9 changed files
- Comments generated: 2
| } else if codePointIsHighSurrogate(codePoint) || codePointIsLowSurrogate(codePoint) { | ||
| // Lone surrogate: encode as CESU-8 so that distinct surrogates remain | ||
| // distinguishable. Go's string(rune) would replace all surrogates with | ||
| // U+FFFD, collapsing e.g. "\uD800" and "\uDC00" into the same value. | ||
| return encodeSurrogate(codePoint) |
| const highSurrogate: "\uD800" = "\uD800"; // ok | ||
| const lowSurrogate: "\uDC00" = "\uDC00"; // ok |
|
@copilot apply changes based on the comments in this thread |
…pareStringsCaseInsensitive
- Fix extended escape branch in scanEscapeSequence to use encodeSurrogate
for surrogate code points in \u{XXXX} syntax
- Add decodeCESU8OrUTF8 to CompareStringsCaseInsensitive so CESU-8
encoded surrogates are properly compared
- Add test cases for \u{D800} vs \u{DC00} and mixed syntax
Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/a677b991-fe22-4da6-a336-875eaefb4eff
Co-authored-by: jakebailey <5341706+jakebailey@users.noreply.github.com>
Good catch. Fixed in ef2a825 —
Done in ef2a825 — the extended escape branch ( |
|
oh no, this is getting spooky I feel like this all really does say that we need to be consistent about using WTF-8 or something everywhere |
Go's
string(rune)replaces surrogate code points (U+D800–U+DFFF) with U+FFFD, so all lone surrogates like"\uD800"and"\uDC00"collapsed to the same internal string value, making the type checker treat them as identical types.scanner.go): UseencodeSurrogate()(CESU-8) for all lone surrogates inscanEscapeSequence, including both fixed-width\uXXXXand extended\u{XXXX}escape syntax. Different surrogates now produce distinct byte sequences.utilities.go): AdddecodeCESU8OrUTF8to recognize CESU-8 surrogates during string escaping, and force surrogates to always emit as\uXXXXregardless ofNeverAsciiEscape.compare.go): AdddecodeCESU8OrUTF8toCompareStringsCaseInsensitiveso CESU-8 encoded surrogates are properly decoded and compared instead of being seen as individual invalid UTF-8 bytes.loneSurrogateStringLiterals.tscovering cross-surrogate assignment errors, same-surrogate identity, extended Unicode escapes (\u{D800}vs\u{DC00}), and mixed syntax equivalence.