Fix accented character rendering (á, é, ñ, €, etc.) with Standard 14 fonts#9
Fix accented character rendering (á, é, ñ, €, etc.) with Standard 14 fonts#9
Conversation
…ndard 14 fonts - Add getEncodingForStandard14() to select the correct encoding per font - Add isWinAnsiStandard14() to distinguish Symbol/ZapfDingbats - Extend CHAR_TO_GLYPH map with all WinAnsi non-ASCII characters (0x80-0x9F and 0xA0-0xFF ranges) fixing width measurement for accented text like é, ñ, ü, €, etc.
…d 14 fonts Three compounding bugs caused accented characters (á, é, ñ, ö, €, etc.) to render as mojibake with Standard 14 fonts like Helvetica: 1. Wrong text encoding: used PDFDocEncoding instead of WinAnsiEncoding 2. UTF-8 round-trip corruption: Operator.toString() → TextDecoder (UTF-8) destroyed non-ASCII bytes when re-encoded via TextEncoder 3. Missing /Encoding in font dict: viewers fell back to StandardEncoding Fix: - encodeTextForFont() now uses WinAnsiEncoding (or SymbolEncoding/ ZapfDingbatsEncoding) with hex-format PdfString output - Unencodable characters substitute with .notdef (byte 0x00) - appendOperators() uses Operator.toBytes() directly, bypassing the string intermediate that caused UTF-8 corruption - createContentStream/appendContent/prependContent accept string | Uint8Array for the broad bytes-first pipeline refactor - addFontResource() adds /Encoding WinAnsiEncoding for Helvetica/ Times/Courier families (omitted for Symbol/ZapfDingbats per spec) - ContentAppender type updated to string | Uint8Array
…14 fonts 29 tests covering: - Font encoding selection (WinAnsi vs Symbol vs ZapfDingbats) - Glyph name mapping for accented/non-ASCII characters - Width measurement correctness for accented text - Font dict /Encoding verification - Hex string encoding in content streams - Unencodable character .notdef substitution - Round-trip PDF generation with all font families - Bytes pipeline backward compatibility (shapes, paths, images)
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
This PR fixes corrupted rendering of accented and Latin-1 characters (á, é, ñ, €, curly quotes, em dashes, etc.) when using Standard 14 fonts like Helvetica, Times-Roman, and Courier. The fix addresses four compounding bugs: wrong text encoding (PDFDocEncoding instead of WinAnsiEncoding), UTF-8 round-trip corruption through the content stream pipeline, missing /Encoding entries in font dictionaries, and incorrect width measurements for accented characters.
Changes:
- Implements proper font encoding selection (WinAnsi for Helvetica/Times/Courier, Symbol/ZapfDingbats for those respective fonts)
- Refactors content stream pipeline to work with
Uint8Arraythroughout, eliminating UTF-8 corruption - Adds
/Encoding /WinAnsiEncodingto Standard 14 font dictionaries (except Symbol/ZapfDingbats) - Extends
CHAR_TO_GLYPHmap to cover all WinAnsi non-ASCII characters for correct width measurement - Uses hex-format PdfStrings for Standard 14 text as defense-in-depth against encoding transformations
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/fonts/standard-14.ts |
Adds encoding helper functions (getEncodingForStandard14, isWinAnsiStandard14) and extends CHAR_TO_GLYPH map with ~95 WinAnsi non-ASCII entries |
src/api/pdf-page.ts |
Refactors encodeTextForFont() to use proper font encodings, updates appendContent/prependContent/createContentStream to accept bytes, implements bytes-first appendOperators(), adds /Encoding to font dicts |
src/api/drawing/path-builder.ts |
Updates ContentAppender type to accept string | Uint8Array for backward-compatible bytes support |
src/api/drawing/latin1-encoding.test.ts |
Adds comprehensive test coverage with 29 tests covering encoding selection, glyph mapping, font dictionary structure, content stream encoding, and round-trip rendering |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
Fixes corrupted rendering of accented/Latin-1 characters (á, é, ñ, ö, €, curly quotes, em dashes, etc.) when using Standard 14 fonts like Helvetica, Times-Roman, and Courier.
Problem
Three compounding bugs caused non-ASCII characters to render as mojibake:
encodeTextForFont()used PDFDocEncoding (a metadata encoding) instead of WinAnsiEncoding, producing incorrect bytes in the 0x80–0x9F range (€, curly quotes, em dash, etc.)Operator.toString()→TextDecoder(UTF-8) →TextEncoder(UTF-8), destroying any non-ASCII byte (e.g., 0xE9 forébecameU+FFFD)/Encodingin font dict — Without an explicit/Encoding WinAnsiEncodingentry, PDF viewers fell back to the font's built-in StandardEncoding, mapping bytes to wrong glyphsgetGlyphName()only mapped ASCII, returning"space"for all accented characters, breaking text layout and measurementSolution
encodeTextForFont()now usesWinAnsiEncodingfor Helvetica/Times/Courier,SymbolEncodingfor Symbol, andZapfDingbatsEncodingfor ZapfDingbats. Unencodable characters (CJK, emoji) substitute with.notdef(byte 0x00).<636166E9>) — pure ASCII that's immune to any encoding transformation. Matches pdf-lib's approach.appendOperators()usesOperator.toBytes()directly.createContentStream/appendContent/prependContentacceptstring | Uint8Array, eliminating the UTF-8 round-trip./Encoding: Standard 14 font dicts now include/Encoding /WinAnsiEncoding(omitted for Symbol/ZapfDingbats per PDF spec Table 5.15).CHAR_TO_GLYPHnow covers all WinAnsi non-ASCII characters (~95 entries), fixing width measurement for accented text.Changes
src/fonts/standard-14.tssrc/api/pdf-page.ts/Encodingsrc/api/drawing/path-builder.tsContentAppendertype acceptsstring | Uint8Arraysrc/api/drawing/latin1-encoding.test.tsTest coverage
/Encodingpresence/absence verification<E9>, not UTF-8C3A1).notdefsubstitution