|
| 1 | +# 044: Fix Latin-1 / Accented Character Rendering with Standard 14 Fonts |
| 2 | + |
| 3 | +## Problem |
| 4 | + |
| 5 | +Drawing text with accented characters (á, é, ñ, ö, etc.) using Standard 14 fonts like Helvetica produces corrupted output. Characters render as mojibake because the content stream pipeline round-trips bytes through UTF-8, which destroys single-byte Latin-1 values. |
| 6 | + |
| 7 | +**Root cause**: Three compounding issues in the content generation pipeline, plus a related width measurement bug: |
| 8 | + |
| 9 | +1. **Wrong text encoding**: `encodeTextForFont()` (`pdf-page.ts:2433`) uses `PdfString.fromString()` which encodes via PDFDocEncoding (a metadata encoding), not WinAnsiEncoding (the font encoding Standard 14 fonts actually use). While the byte values happen to match for U+00A0–U+00FF, they diverge in the 0x80–0x9F range (€ is 0x80 in WinAnsi but 0xA0 in PDFDocEncoding; curly quotes, em dash, etc. all differ). |
| 10 | + |
| 11 | +2. **UTF-8 round-trip corruption**: The pipeline converts `Operator` → `toString()` (UTF-8 decode via `TextDecoder`) → `appendContent(string)` → `TextEncoder.encode()` (UTF-8 encode). When a `PdfString` literal contains raw byte 0xE9 (WinAnsi `é`), the UTF-8 decode treats it as an invalid sequence and produces `U+FFFD`, destroying the original byte. |
| 12 | + |
| 13 | +3. **Missing `/Encoding` in font dict**: The Standard 14 font dictionary (`pdf-page.ts:2392-2397`) is emitted without an `/Encoding` entry, so viewers fall back to the font's built-in encoding (typically StandardEncoding for Type1), not WinAnsiEncoding. Even if bytes were correct, the wrong encoding means wrong glyphs. |
| 14 | + |
| 15 | +4. **Wrong width measurement**: `getGlyphName()` (`standard-14.ts:262`) only maps ASCII code points to glyph names. Any non-ASCII character (é, ñ, ü, etc.) falls through to return `"space"`, meaning `widthOfTextAtSize()` returns incorrect widths for accented text. This breaks text layout, line wrapping, and centering. |
| 16 | + |
| 17 | +## Goals |
| 18 | + |
| 19 | +- Accented Latin characters (á, é, ñ, ü, ß, €, etc.) render correctly with all Standard 14 fonts |
| 20 | +- Symbol and ZapfDingbats fonts work correctly with their built-in encodings |
| 21 | +- Text width measurement is correct for all WinAnsi characters |
| 22 | +- Embedded fonts (Identity-H with GIDs) continue to work unchanged |
| 23 | +- The content stream pipeline works with `Uint8Array` throughout, eliminating the UTF-8 round-trip |
| 24 | +- Unencodable characters (CJK, emoji) produce `.notdef` by default with an option to throw |
| 25 | + |
| 26 | +## Scope |
| 27 | + |
| 28 | +### In scope |
| 29 | + |
| 30 | +- Fix all four issues above |
| 31 | +- Broad bytes-first refactor of the content stream pipeline (all callers move to bytes) |
| 32 | +- Wire up WinAnsiEncoding for Standard 14 fonts (except Symbol/ZapfDingbats) |
| 33 | +- Wire up SymbolEncoding and ZapfDingbatsEncoding for those two fonts |
| 34 | +- Fix `getGlyphName()` to cover all WinAnsi non-ASCII glyph names |
| 35 | +- Add tests for accented character rendering, width measurement, and all encoding paths |
| 36 | + |
| 37 | +### Out of scope |
| 38 | + |
| 39 | +- Custom encoding differences arrays |
| 40 | +- Text extraction / parsing (already works correctly) |
| 41 | + |
| 42 | +## Design |
| 43 | + |
| 44 | +### The core insight |
| 45 | + |
| 46 | +The content stream pipeline currently uses strings as an intermediate representation between operators and bytes. This is the fundamental problem — PDF content streams are binary, and shuttling them through JavaScript strings (which are UTF-16 internally) and then through UTF-8 TextEncoder/TextDecoder corrupts any non-ASCII bytes. |
| 47 | + |
| 48 | +The fix makes the pipeline work with `Uint8Array` throughout, avoiding the string round-trip entirely. At the same time, we use `WinAnsiEncoding` (which already exists in the codebase but is only used for parsing) to properly encode text for Standard 14 fonts. |
| 49 | + |
| 50 | +### Approach: bytes-first pipeline |
| 51 | + |
| 52 | +The reporter's fix (converting string char-by-char via `charCodeAt & 0xFF`) works but is a band-aid that relies on JavaScript strings preserving Latin-1 byte values. Our approach is cleaner: |
| 53 | + |
| 54 | +**1. `encodeTextForFont()` — use proper font encoding for all Standard 14 fonts** |
| 55 | + |
| 56 | +Instead of `PdfString.fromString(text)` (PDFDocEncoding), select the correct encoding based on font name: |
| 57 | + |
| 58 | +- **Helvetica, Times, Courier families** → `WinAnsiEncoding.instance` |
| 59 | +- **Symbol** → `SymbolEncoding.instance` |
| 60 | +- **ZapfDingbats** → `ZapfDingbatsEncoding.instance` |
| 61 | + |
| 62 | +Call `encoding.encode(text)` to produce the correct byte values, then wrap in a hex-format `PdfString`. This properly handles the 0x80–0x9F range where PDFDocEncoding and WinAnsiEncoding differ. |
| 63 | + |
| 64 | +**Unencodable characters**: By default, substitute with the `.notdef` glyph (byte 0x00). This matches PDF convention — the font's `.notdef` glyph typically renders as an empty box or blank space. Users who prefer a hard failure can pass an option to throw instead (see API below). The rationale: leniency by default matches the project's design principle of being tolerant, while the option gives strict users control. |
| 65 | + |
| 66 | +**2. `appendContent()` / `appendOperators()` — broad bytes-first refactor** |
| 67 | + |
| 68 | +Refactor the entire content pipeline to work with `Uint8Array`: |
| 69 | + |
| 70 | +- `appendContent()` accepts `string | Uint8Array`. String inputs get `TextEncoder`'d (safe for ASCII-only callers like `drawPage` and `drawImage`). `Uint8Array` inputs pass through directly. |
| 71 | +- `appendOperators()` uses `Operator.toBytes()` directly, concatenates into a `Uint8Array`, and passes bytes to `appendContent()`. |
| 72 | +- `createContentStream()` gains a `Uint8Array` overload that skips the `TextEncoder` step. |
| 73 | +- `prependContent()` gets the same `string | Uint8Array` treatment for consistency. |
| 74 | +- `ContentAppender` type in `path-builder.ts` changes to `(content: string | Uint8Array) => void`. |
| 75 | +- `PathBuilder.emitOps()` can migrate to bytes at its own pace — it only produces ASCII content (path operators, numbers), so the string path remains safe for it. |
| 76 | + |
| 77 | +This is the principled fix: content streams are binary data, and the pipeline treats them as such. The `toString()` method on `Operator` remains for debugging/logging, but the serialization path uses `toBytes()`. |
| 78 | + |
| 79 | +**3. `addFontResource()` — add `/Encoding` where appropriate** |
| 80 | + |
| 81 | +| Font | `/Encoding` value | Reason | |
| 82 | +| ---------------- | ----------------- | -------------------------------------------------------------------------------- | |
| 83 | +| Helvetica family | `WinAnsiEncoding` | Explicit encoding ensures correct glyph mapping | |
| 84 | +| Times family | `WinAnsiEncoding` | Same | |
| 85 | +| Courier family | `WinAnsiEncoding` | Same | |
| 86 | +| Symbol | _(omitted)_ | Uses built-in encoding; no valid `/Encoding` name exists per PDF spec Table 5.15 | |
| 87 | +| ZapfDingbats | _(omitted)_ | Same as Symbol | |
| 88 | + |
| 89 | +For Symbol and ZapfDingbats, `SymbolEncoding` / `ZapfDingbatsEncoding` are used only for Unicode → byte mapping in `encodeTextForFont()`. The font dict has no `/Encoding` entry because the PDF spec doesn't define named encodings for these fonts — their built-in encoding is implicit. |
| 90 | + |
| 91 | +**4. Fix `getGlyphName()` for non-ASCII characters** |
| 92 | + |
| 93 | +Extend the `CHAR_TO_GLYPH` map in `standard-14.ts` to cover all WinAnsi non-ASCII code points. The WinAnsiEncoding table maps Unicode code points to byte values, and the glyph width tables already have entries for all these glyphs (e.g., `eacute`, `ntilde`, `Euro`, `endash`, `Adieresis`). We need to bridge the gap: given a Unicode character, look up its glyph name so we can look up its width. |
| 94 | + |
| 95 | +The approach: use `WinAnsiEncoding` to map Unicode → byte code, then use the Adobe Glyph List (already in `glyph-list.ts`) or a direct Unicode → glyph name mapping to find the glyph name. Alternatively, extend `CHAR_TO_GLYPH` with all the Latin-1 supplement and WinAnsi 0x80-0x9F entries directly. |
| 96 | + |
| 97 | +### Why hex format for Standard 14 text? |
| 98 | + |
| 99 | +Using hex format (`<E9>` instead of `(é)`) for Standard 14 font text strings is the most robust approach: |
| 100 | + |
| 101 | +- **Defense-in-depth**: Even though the bytes pipeline is correct, hex format is immune to any future string-based manipulation of content streams |
| 102 | +- **Precedent**: pdf-lib uses hex strings for all Standard 14 font text (see `StandardFontEmbedder.encodeText()`) |
| 103 | +- **Simpler code**: No need to worry about escaping parentheses, backslashes, or non-ASCII bytes in literal strings |
| 104 | +- **Trade-off**: Slightly larger output (2 hex chars per byte vs 1 byte in literal), but content streams are typically compressed anyway |
| 105 | + |
| 106 | +### Changes summary |
| 107 | + |
| 108 | +| File | Method/Area | Change | |
| 109 | +| --------------------------------- | ---------------------------------- | -------------------------------------------------------------------------------------------------------- | |
| 110 | +| `src/api/pdf-page.ts` | `encodeTextForFont()` | Use WinAnsi/Symbol/ZapfDingbats encoding + hex `PdfString`; `.notdef` substitution for unencodable chars | |
| 111 | +| `src/api/pdf-page.ts` | `appendContent()` | Accept `string \| Uint8Array`; bytes pass through, strings get TextEncoder'd | |
| 112 | +| `src/api/pdf-page.ts` | `prependContent()` | Same dual-type support | |
| 113 | +| `src/api/pdf-page.ts` | `createContentStream()` | Accept `string \| Uint8Array`; skip TextEncoder for bytes | |
| 114 | +| `src/api/pdf-page.ts` | `appendOperators()` | Use `Operator.toBytes()` directly, pass `Uint8Array` | |
| 115 | +| `src/api/pdf-page.ts` | `addFontResource()` | Add `/Encoding WinAnsiEncoding` for non-Symbol/ZapfDingbats Standard 14 fonts | |
| 116 | +| `src/api/drawing/path-builder.ts` | `ContentAppender` type | Change to `(content: string \| Uint8Array) => void` | |
| 117 | +| `src/fonts/standard-14.ts` | `getGlyphName()` / `CHAR_TO_GLYPH` | Extend to cover all WinAnsi non-ASCII characters | |
| 118 | + |
| 119 | +### Desired usage |
| 120 | + |
| 121 | +From the user's perspective, nothing changes — the existing API just works: |
| 122 | + |
| 123 | +```typescript |
| 124 | +const page = pdf.addPage(); |
| 125 | + |
| 126 | +// Latin-1 accented characters work with Standard 14 fonts |
| 127 | +page.drawText("Héllo café naïve résumé", { |
| 128 | + font: "Helvetica", |
| 129 | + x: 50, |
| 130 | + y: 700, |
| 131 | + size: 14, |
| 132 | +}); |
| 133 | + |
| 134 | +// Characters in the 0x80-0x9F WinAnsi range also work |
| 135 | +page.drawText("Price: €42 — "special" edition", { |
| 136 | + font: "Times-Roman", |
| 137 | + x: 50, |
| 138 | + y: 650, |
| 139 | + size: 14, |
| 140 | +}); |
| 141 | + |
| 142 | +// Symbol and ZapfDingbats work with their own encodings |
| 143 | +page.drawText("αβγδ", { font: "Symbol", x: 50, y: 600, size: 14 }); |
| 144 | + |
| 145 | +// Unencodable characters silently become .notdef (empty box) by default |
| 146 | +page.drawText("Hello 世界", { font: "Helvetica", x: 50, y: 550, size: 14 }); |
| 147 | +// Renders: "Hello " followed by two empty boxes |
| 148 | + |
| 149 | +// Width measurement is correct for accented text |
| 150 | +const width = page.widthOfTextAtSize("café", "Helvetica", 12); |
| 151 | +// Returns correct width using eacute glyph width, not space |
| 152 | +``` |
| 153 | + |
| 154 | +## Test plan |
| 155 | + |
| 156 | +### Rendering correctness |
| 157 | + |
| 158 | +- Round-trip test: draw accented text ("café résumé naïve") with Helvetica, save, re-parse, extract text, verify it matches input |
| 159 | +- Verify hex string encoding in content stream: `é` → byte `0xE9`, not UTF-8 `0xC3 0xA1` |
| 160 | +- Test the full WinAnsi range including 0x80–0x9F characters (€, †, ‡, curly quotes, em dash, ellipsis) |
| 161 | +- Test all Standard 14 font families (Helvetica, Times, Courier) with accented text |
| 162 | + |
| 163 | +### Font dictionary |
| 164 | + |
| 165 | +- Verify Helvetica/Times/Courier font dicts contain `/Encoding /WinAnsiEncoding` |
| 166 | +- Verify Symbol font dict does **not** contain `/Encoding` |
| 167 | +- Verify ZapfDingbats font dict does **not** contain `/Encoding` |
| 168 | + |
| 169 | +### Symbol and ZapfDingbats |
| 170 | + |
| 171 | +- Verify Symbol font correctly encodes Greek letters (α → correct Symbol byte) |
| 172 | +- Verify ZapfDingbats correctly encodes decorative symbols |
| 173 | + |
| 174 | +### Encoding edge cases |
| 175 | + |
| 176 | +- Unencodable characters (CJK, emoji) produce `.notdef` byte (0x00) by default |
| 177 | +- Embedded fonts continue to work unchanged (Identity-H path with GIDs) |
| 178 | + |
| 179 | +### Width measurement |
| 180 | + |
| 181 | +- `widthOfTextAtSize("é", "Helvetica", 1000)` returns `eacute` width (556), not `space` width (278) |
| 182 | +- Width of "café" equals width of "caf" + width of "eacute" glyph |
| 183 | +- Width correct for 0x80-0x9F characters (€ = Euro glyph width) |
| 184 | + |
| 185 | +### Bytes pipeline |
| 186 | + |
| 187 | +- Verify `appendContent(Uint8Array)` passes bytes through without TextEncoder transformation |
| 188 | +- Verify `appendContent(string)` still works for ASCII content (drawImage, drawPage) |
| 189 | +- PathBuilder operations still produce correct output |
| 190 | + |
| 191 | +## Decisions made |
| 192 | + |
| 193 | +1. **Bytes pipeline scope**: Broad — all content-producing paths move to `Uint8Array`, not just `appendOperators()`. The `ContentAppender` type becomes `string | Uint8Array` to allow gradual migration of callers. |
| 194 | + |
| 195 | +2. **Hex vs literal format**: Always hex for Standard 14 text, as defense-in-depth. Even with the bytes pipeline fix, hex format provides immunity against any future string-based manipulation. |
| 196 | + |
| 197 | +3. **Unencodable characters**: Default to `.notdef` glyph substitution (byte 0x00). The font's `.notdef` glyph typically renders as an empty box or blank. This is lenient-by-default per the project's design principles. |
| 198 | + |
| 199 | +4. **Symbol and ZapfDingbats**: Wire up their proper encodings now. Use `SymbolEncoding.instance` and `ZapfDingbatsEncoding.instance` for Unicode → byte mapping. Omit `/Encoding` from the font dict (no valid named encoding exists per PDF spec Table 5.15 — the fonts use their built-in encoding implicitly). |
| 200 | + |
| 201 | +5. **Width measurement**: Fix in this plan. Extend `CHAR_TO_GLYPH` in `standard-14.ts` to cover all WinAnsi non-ASCII characters. Without this, text layout (line wrapping, centering) would be broken for accented text even if rendering is fixed. |
0 commit comments