Skip to content

Fix type soundness for lone-surrogate string literals#3912

Open
Copilot wants to merge 4 commits into
mainfrom
copilot/fix-tsgo-lone-surrogate-issue
Open

Fix type soundness for lone-surrogate string literals#3912
Copilot wants to merge 4 commits into
mainfrom
copilot/fix-tsgo-lone-surrogate-issue

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 16, 2026

Go's string(rune) replaces surrogate code points (U+D800–U+DFFF) with U+FFFD, so all lone surrogates like "\uD800" and "\uDC00" collapsed to the same internal string value, making the type checker treat them as identical types.

const x: "\uD800" = "\uDC00"; // should error, was accepted
  • Scanner (scanner.go): Use encodeSurrogate() (CESU-8) for all lone surrogates in scanEscapeSequence, including both fixed-width \uXXXX and extended \u{XXXX} escape syntax. Different surrogates now produce distinct byte sequences.
  • Printer (utilities.go): Add decodeCESU8OrUTF8 to recognize CESU-8 surrogates during string escaping, and force surrogates to always emit as \uXXXX regardless of NeverAsciiEscape.
  • String comparison (compare.go): Add decodeCESU8OrUTF8 to CompareStringsCaseInsensitive so CESU-8 encoded surrogates are properly decoded and compared instead of being seen as individual invalid UTF-8 bytes.
  • Test: New loneSurrogateStringLiterals.ts covering cross-surrogate assignment errors, same-surrogate identity, extended Unicode escapes (\u{D800} vs \u{DC00}), and mixed syntax equivalence.

Copilot AI and others added 2 commits May 16, 2026 16:40
Use CESU-8 encoding for lone surrogates in string literal scanning so
different surrogates remain distinct. Add CESU-8 surrogate detection in
the printer's escapeStringWorker to properly render them as \uXXXX
escape sequences in output.

Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/408ccedd-6e7b-4112-befe-36f10f374c56

Co-authored-by: jakebailey <5341706+jakebailey@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix tsgo breaking type soundness for lone-surrogate string literals Fix type soundness for lone-surrogate string literals May 16, 2026
Copilot AI requested a review from jakebailey May 16, 2026 17:02
@DanielRosenwasser
Copy link
Copy Markdown
Member

There's a few places that rely on text from strings that we may want adjusted as well. For example, CompareStringsCaseInsensitive calls utf8.DecodeRuneInString, but CompareStringsCaseInsensitive is called by compareDeclarationInfos.

@DanielRosenwasser DanielRosenwasser marked this pull request as ready for review May 18, 2026 19:07
Copilot AI review requested due to automatic review settings May 18, 2026 19:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes handling of lone-surrogate string literal values so distinct surrogate code units no longer collapse to U+FFFD in type checking and declaration/type output.

Changes:

  • Encodes lone \uXXXX surrogate escapes as CESU-8 sentinel bytes in the scanner.
  • Updates printer escaping to recognize CESU-8 surrogate sentinels and emit \uXXXX.
  • Adds a compiler regression test and updates affected baselines.
Show a summary per file
File Description
internal/scanner/scanner.go Preserves lone surrogate escapes as distinct internal values.
internal/printer/utilities.go Prints CESU-8 surrogate sentinels as UTF-16 escape sequences.
testdata/tests/cases/compiler/loneSurrogateStringLiterals.ts Adds regression coverage for surrogate literal assignability.
testdata/baselines/reference/compiler/loneSurrogateStringLiterals.errors.txt Adds expected assignability diagnostics.
testdata/baselines/reference/compiler/loneSurrogateStringLiterals.js Adds JS emit baseline for the new test.
testdata/baselines/reference/compiler/loneSurrogateStringLiterals.symbols Adds symbol baseline for the new test.
testdata/baselines/reference/compiler/loneSurrogateStringLiterals.types Adds type baseline for the new test.
testdata/baselines/reference/compiler/unicodeSurrogatesInStringLiterals.js Updates declaration emit baseline for lone surrogates.
testdata/baselines/reference/compiler/unicodeSurrogatesInStringLiterals.types Updates type display baseline for lone surrogates.

Copilot's findings

  • Files reviewed: 9/9 changed files
  • Comments generated: 2

Comment on lines +1768 to 1772
} else if codePointIsHighSurrogate(codePoint) || codePointIsLowSurrogate(codePoint) {
// Lone surrogate: encode as CESU-8 so that distinct surrogates remain
// distinguishable. Go's string(rune) would replace all surrogates with
// U+FFFD, collapsing e.g. "\uD800" and "\uDC00" into the same value.
return encodeSurrogate(codePoint)
Comment on lines +4 to +5
const highSurrogate: "\uD800" = "\uD800"; // ok
const lowSurrogate: "\uDC00" = "\uDC00"; // ok
@jakebailey
Copy link
Copy Markdown
Member

@copilot apply changes based on the comments in this thread

…pareStringsCaseInsensitive

- Fix extended escape branch in scanEscapeSequence to use encodeSurrogate
  for surrogate code points in \u{XXXX} syntax
- Add decodeCESU8OrUTF8 to CompareStringsCaseInsensitive so CESU-8
  encoded surrogates are properly compared
- Add test cases for \u{D800} vs \u{DC00} and mixed syntax

Agent-Logs-Url: https://github.com/microsoft/typescript-go/sessions/a677b991-fe22-4da6-a336-875eaefb4eff

Co-authored-by: jakebailey <5341706+jakebailey@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 18, 2026

There's a few places that rely on text from strings that we may want adjusted as well. For example, CompareStringsCaseInsensitive calls utf8.DecodeRuneInString, but CompareStringsCaseInsensitive...

Good catch. Fixed in ef2a825CompareStringsCaseInsensitive now uses a decodeCESU8OrUTF8 helper that recognizes the CESU-8 surrogate byte sequences, so different surrogates won't collapse to the same RuneError during comparison.

@copilot apply changes based on the comments in this thread

Done in ef2a825 — the extended escape branch (\u{D800}) now uses encodeSurrogate for surrogate code points, and the test covers \u{D800} vs \u{DC00} as well as mixed syntax (\uD800 vs \u{DC00}).

@jakebailey
Copy link
Copy Markdown
Member

oh no, this is getting spooky

I feel like this all really does say that we need to be consistent about using WTF-8 or something everywhere

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tsgo breaks type soundness for lone-surrogate string literals

4 participants