Skip to content

Commit 13bd3f4

Browse files
authored
Fix accented character rendering (á, é, ñ, €, etc.) with Standard 14 fonts (#9)
* feat(fonts): add WinAnsi/Symbol/ZapfDingbats encoding helpers for Standard 14 fonts - Add getEncodingForStandard14() to select the correct encoding per font - Add isWinAnsiStandard14() to distinguish Symbol/ZapfDingbats - Extend CHAR_TO_GLYPH map with all WinAnsi non-ASCII characters (0x80-0x9F and 0xA0-0xFF ranges) fixing width measurement for accented text like é, ñ, ü, €, etc. * fix(encoding): fix Latin-1/accented character corruption with Standard 14 fonts Three compounding bugs caused accented characters (á, é, ñ, ö, €, etc.) to render as mojibake with Standard 14 fonts like Helvetica: 1. Wrong text encoding: used PDFDocEncoding instead of WinAnsiEncoding 2. UTF-8 round-trip corruption: Operator.toString() → TextDecoder (UTF-8) destroyed non-ASCII bytes when re-encoded via TextEncoder 3. Missing /Encoding in font dict: viewers fell back to StandardEncoding Fix: - encodeTextForFont() now uses WinAnsiEncoding (or SymbolEncoding/ ZapfDingbatsEncoding) with hex-format PdfString output - Unencodable characters substitute with .notdef (byte 0x00) - appendOperators() uses Operator.toBytes() directly, bypassing the string intermediate that caused UTF-8 corruption - createContentStream/appendContent/prependContent accept string | Uint8Array for the broad bytes-first pipeline refactor - addFontResource() adds /Encoding WinAnsiEncoding for Helvetica/ Times/Courier families (omitted for Symbol/ZapfDingbats per spec) - ContentAppender type updated to string | Uint8Array * test(encoding): add tests for Latin-1/WinAnsi encoding with Standard 14 fonts 29 tests covering: - Font encoding selection (WinAnsi vs Symbol vs ZapfDingbats) - Glyph name mapping for accented/non-ASCII characters - Width measurement correctness for accented text - Font dict /Encoding verification - Hex string encoding in content streams - Unencodable character .notdef substitution - Round-trip PDF generation with all font families - Bytes pipeline backward compatibility (shapes, paths, images) * fix: update test
1 parent 1295b95 commit 13bd3f4

File tree

5 files changed

+914
-18
lines changed

5 files changed

+914
-18
lines changed
Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
# 044: Fix Latin-1 / Accented Character Rendering with Standard 14 Fonts
2+
3+
## Problem
4+
5+
Drawing text with accented characters (á, é, ñ, ö, etc.) using Standard 14 fonts like Helvetica produces corrupted output. Characters render as mojibake because the content stream pipeline round-trips bytes through UTF-8, which destroys single-byte Latin-1 values.
6+
7+
**Root cause**: Three compounding issues in the content generation pipeline, plus a related width measurement bug:
8+
9+
1. **Wrong text encoding**: `encodeTextForFont()` (`pdf-page.ts:2433`) uses `PdfString.fromString()` which encodes via PDFDocEncoding (a metadata encoding), not WinAnsiEncoding (the font encoding Standard 14 fonts actually use). While the byte values happen to match for U+00A0–U+00FF, they diverge in the 0x80–0x9F range (€ is 0x80 in WinAnsi but 0xA0 in PDFDocEncoding; curly quotes, em dash, etc. all differ).
10+
11+
2. **UTF-8 round-trip corruption**: The pipeline converts `Operator``toString()` (UTF-8 decode via `TextDecoder`) → `appendContent(string)``TextEncoder.encode()` (UTF-8 encode). When a `PdfString` literal contains raw byte 0xE9 (WinAnsi `é`), the UTF-8 decode treats it as an invalid sequence and produces `U+FFFD`, destroying the original byte.
12+
13+
3. **Missing `/Encoding` in font dict**: The Standard 14 font dictionary (`pdf-page.ts:2392-2397`) is emitted without an `/Encoding` entry, so viewers fall back to the font's built-in encoding (typically StandardEncoding for Type1), not WinAnsiEncoding. Even if bytes were correct, the wrong encoding means wrong glyphs.
14+
15+
4. **Wrong width measurement**: `getGlyphName()` (`standard-14.ts:262`) only maps ASCII code points to glyph names. Any non-ASCII character (é, ñ, ü, etc.) falls through to return `"space"`, meaning `widthOfTextAtSize()` returns incorrect widths for accented text. This breaks text layout, line wrapping, and centering.
16+
17+
## Goals
18+
19+
- Accented Latin characters (á, é, ñ, ü, ß, €, etc.) render correctly with all Standard 14 fonts
20+
- Symbol and ZapfDingbats fonts work correctly with their built-in encodings
21+
- Text width measurement is correct for all WinAnsi characters
22+
- Embedded fonts (Identity-H with GIDs) continue to work unchanged
23+
- The content stream pipeline works with `Uint8Array` throughout, eliminating the UTF-8 round-trip
24+
- Unencodable characters (CJK, emoji) produce `.notdef` by default with an option to throw
25+
26+
## Scope
27+
28+
### In scope
29+
30+
- Fix all four issues above
31+
- Broad bytes-first refactor of the content stream pipeline (all callers move to bytes)
32+
- Wire up WinAnsiEncoding for Standard 14 fonts (except Symbol/ZapfDingbats)
33+
- Wire up SymbolEncoding and ZapfDingbatsEncoding for those two fonts
34+
- Fix `getGlyphName()` to cover all WinAnsi non-ASCII glyph names
35+
- Add tests for accented character rendering, width measurement, and all encoding paths
36+
37+
### Out of scope
38+
39+
- Custom encoding differences arrays
40+
- Text extraction / parsing (already works correctly)
41+
42+
## Design
43+
44+
### The core insight
45+
46+
The content stream pipeline currently uses strings as an intermediate representation between operators and bytes. This is the fundamental problem — PDF content streams are binary, and shuttling them through JavaScript strings (which are UTF-16 internally) and then through UTF-8 TextEncoder/TextDecoder corrupts any non-ASCII bytes.
47+
48+
The fix makes the pipeline work with `Uint8Array` throughout, avoiding the string round-trip entirely. At the same time, we use `WinAnsiEncoding` (which already exists in the codebase but is only used for parsing) to properly encode text for Standard 14 fonts.
49+
50+
### Approach: bytes-first pipeline
51+
52+
The reporter's fix (converting string char-by-char via `charCodeAt & 0xFF`) works but is a band-aid that relies on JavaScript strings preserving Latin-1 byte values. Our approach is cleaner:
53+
54+
**1. `encodeTextForFont()` — use proper font encoding for all Standard 14 fonts**
55+
56+
Instead of `PdfString.fromString(text)` (PDFDocEncoding), select the correct encoding based on font name:
57+
58+
- **Helvetica, Times, Courier families**`WinAnsiEncoding.instance`
59+
- **Symbol**`SymbolEncoding.instance`
60+
- **ZapfDingbats**`ZapfDingbatsEncoding.instance`
61+
62+
Call `encoding.encode(text)` to produce the correct byte values, then wrap in a hex-format `PdfString`. This properly handles the 0x80–0x9F range where PDFDocEncoding and WinAnsiEncoding differ.
63+
64+
**Unencodable characters**: By default, substitute with the `.notdef` glyph (byte 0x00). This matches PDF convention — the font's `.notdef` glyph typically renders as an empty box or blank space. Users who prefer a hard failure can pass an option to throw instead (see API below). The rationale: leniency by default matches the project's design principle of being tolerant, while the option gives strict users control.
65+
66+
**2. `appendContent()` / `appendOperators()` — broad bytes-first refactor**
67+
68+
Refactor the entire content pipeline to work with `Uint8Array`:
69+
70+
- `appendContent()` accepts `string | Uint8Array`. String inputs get `TextEncoder`'d (safe for ASCII-only callers like `drawPage` and `drawImage`). `Uint8Array` inputs pass through directly.
71+
- `appendOperators()` uses `Operator.toBytes()` directly, concatenates into a `Uint8Array`, and passes bytes to `appendContent()`.
72+
- `createContentStream()` gains a `Uint8Array` overload that skips the `TextEncoder` step.
73+
- `prependContent()` gets the same `string | Uint8Array` treatment for consistency.
74+
- `ContentAppender` type in `path-builder.ts` changes to `(content: string | Uint8Array) => void`.
75+
- `PathBuilder.emitOps()` can migrate to bytes at its own pace — it only produces ASCII content (path operators, numbers), so the string path remains safe for it.
76+
77+
This is the principled fix: content streams are binary data, and the pipeline treats them as such. The `toString()` method on `Operator` remains for debugging/logging, but the serialization path uses `toBytes()`.
78+
79+
**3. `addFontResource()` — add `/Encoding` where appropriate**
80+
81+
| Font | `/Encoding` value | Reason |
82+
| ---------------- | ----------------- | -------------------------------------------------------------------------------- |
83+
| Helvetica family | `WinAnsiEncoding` | Explicit encoding ensures correct glyph mapping |
84+
| Times family | `WinAnsiEncoding` | Same |
85+
| Courier family | `WinAnsiEncoding` | Same |
86+
| Symbol | _(omitted)_ | Uses built-in encoding; no valid `/Encoding` name exists per PDF spec Table 5.15 |
87+
| ZapfDingbats | _(omitted)_ | Same as Symbol |
88+
89+
For Symbol and ZapfDingbats, `SymbolEncoding` / `ZapfDingbatsEncoding` are used only for Unicode → byte mapping in `encodeTextForFont()`. The font dict has no `/Encoding` entry because the PDF spec doesn't define named encodings for these fonts — their built-in encoding is implicit.
90+
91+
**4. Fix `getGlyphName()` for non-ASCII characters**
92+
93+
Extend the `CHAR_TO_GLYPH` map in `standard-14.ts` to cover all WinAnsi non-ASCII code points. The WinAnsiEncoding table maps Unicode code points to byte values, and the glyph width tables already have entries for all these glyphs (e.g., `eacute`, `ntilde`, `Euro`, `endash`, `Adieresis`). We need to bridge the gap: given a Unicode character, look up its glyph name so we can look up its width.
94+
95+
The approach: use `WinAnsiEncoding` to map Unicode → byte code, then use the Adobe Glyph List (already in `glyph-list.ts`) or a direct Unicode → glyph name mapping to find the glyph name. Alternatively, extend `CHAR_TO_GLYPH` with all the Latin-1 supplement and WinAnsi 0x80-0x9F entries directly.
96+
97+
### Why hex format for Standard 14 text?
98+
99+
Using hex format (`<E9>` instead of `(é)`) for Standard 14 font text strings is the most robust approach:
100+
101+
- **Defense-in-depth**: Even though the bytes pipeline is correct, hex format is immune to any future string-based manipulation of content streams
102+
- **Precedent**: pdf-lib uses hex strings for all Standard 14 font text (see `StandardFontEmbedder.encodeText()`)
103+
- **Simpler code**: No need to worry about escaping parentheses, backslashes, or non-ASCII bytes in literal strings
104+
- **Trade-off**: Slightly larger output (2 hex chars per byte vs 1 byte in literal), but content streams are typically compressed anyway
105+
106+
### Changes summary
107+
108+
| File | Method/Area | Change |
109+
| --------------------------------- | ---------------------------------- | -------------------------------------------------------------------------------------------------------- |
110+
| `src/api/pdf-page.ts` | `encodeTextForFont()` | Use WinAnsi/Symbol/ZapfDingbats encoding + hex `PdfString`; `.notdef` substitution for unencodable chars |
111+
| `src/api/pdf-page.ts` | `appendContent()` | Accept `string \| Uint8Array`; bytes pass through, strings get TextEncoder'd |
112+
| `src/api/pdf-page.ts` | `prependContent()` | Same dual-type support |
113+
| `src/api/pdf-page.ts` | `createContentStream()` | Accept `string \| Uint8Array`; skip TextEncoder for bytes |
114+
| `src/api/pdf-page.ts` | `appendOperators()` | Use `Operator.toBytes()` directly, pass `Uint8Array` |
115+
| `src/api/pdf-page.ts` | `addFontResource()` | Add `/Encoding WinAnsiEncoding` for non-Symbol/ZapfDingbats Standard 14 fonts |
116+
| `src/api/drawing/path-builder.ts` | `ContentAppender` type | Change to `(content: string \| Uint8Array) => void` |
117+
| `src/fonts/standard-14.ts` | `getGlyphName()` / `CHAR_TO_GLYPH` | Extend to cover all WinAnsi non-ASCII characters |
118+
119+
### Desired usage
120+
121+
From the user's perspective, nothing changes — the existing API just works:
122+
123+
```typescript
124+
const page = pdf.addPage();
125+
126+
// Latin-1 accented characters work with Standard 14 fonts
127+
page.drawText("Héllo café naïve résumé", {
128+
font: "Helvetica",
129+
x: 50,
130+
y: 700,
131+
size: 14,
132+
});
133+
134+
// Characters in the 0x80-0x9F WinAnsi range also work
135+
page.drawText("Price: €42 — "special" edition", {
136+
font: "Times-Roman",
137+
x: 50,
138+
y: 650,
139+
size: 14,
140+
});
141+
142+
// Symbol and ZapfDingbats work with their own encodings
143+
page.drawText("αβγδ", { font: "Symbol", x: 50, y: 600, size: 14 });
144+
145+
// Unencodable characters silently become .notdef (empty box) by default
146+
page.drawText("Hello 世界", { font: "Helvetica", x: 50, y: 550, size: 14 });
147+
// Renders: "Hello " followed by two empty boxes
148+
149+
// Width measurement is correct for accented text
150+
const width = page.widthOfTextAtSize("café", "Helvetica", 12);
151+
// Returns correct width using eacute glyph width, not space
152+
```
153+
154+
## Test plan
155+
156+
### Rendering correctness
157+
158+
- Round-trip test: draw accented text ("café résumé naïve") with Helvetica, save, re-parse, extract text, verify it matches input
159+
- Verify hex string encoding in content stream: `é` → byte `0xE9`, not UTF-8 `0xC3 0xA1`
160+
- Test the full WinAnsi range including 0x80–0x9F characters (€, †, ‡, curly quotes, em dash, ellipsis)
161+
- Test all Standard 14 font families (Helvetica, Times, Courier) with accented text
162+
163+
### Font dictionary
164+
165+
- Verify Helvetica/Times/Courier font dicts contain `/Encoding /WinAnsiEncoding`
166+
- Verify Symbol font dict does **not** contain `/Encoding`
167+
- Verify ZapfDingbats font dict does **not** contain `/Encoding`
168+
169+
### Symbol and ZapfDingbats
170+
171+
- Verify Symbol font correctly encodes Greek letters (α → correct Symbol byte)
172+
- Verify ZapfDingbats correctly encodes decorative symbols
173+
174+
### Encoding edge cases
175+
176+
- Unencodable characters (CJK, emoji) produce `.notdef` byte (0x00) by default
177+
- Embedded fonts continue to work unchanged (Identity-H path with GIDs)
178+
179+
### Width measurement
180+
181+
- `widthOfTextAtSize("é", "Helvetica", 1000)` returns `eacute` width (556), not `space` width (278)
182+
- Width of "café" equals width of "caf" + width of "eacute" glyph
183+
- Width correct for 0x80-0x9F characters (€ = Euro glyph width)
184+
185+
### Bytes pipeline
186+
187+
- Verify `appendContent(Uint8Array)` passes bytes through without TextEncoder transformation
188+
- Verify `appendContent(string)` still works for ASCII content (drawImage, drawPage)
189+
- PathBuilder operations still produce correct output
190+
191+
## Decisions made
192+
193+
1. **Bytes pipeline scope**: Broad — all content-producing paths move to `Uint8Array`, not just `appendOperators()`. The `ContentAppender` type becomes `string | Uint8Array` to allow gradual migration of callers.
194+
195+
2. **Hex vs literal format**: Always hex for Standard 14 text, as defense-in-depth. Even with the bytes pipeline fix, hex format provides immunity against any future string-based manipulation.
196+
197+
3. **Unencodable characters**: Default to `.notdef` glyph substitution (byte 0x00). The font's `.notdef` glyph typically renders as an empty box or blank. This is lenient-by-default per the project's design principles.
198+
199+
4. **Symbol and ZapfDingbats**: Wire up their proper encodings now. Use `SymbolEncoding.instance` and `ZapfDingbatsEncoding.instance` for Unicode → byte mapping. Omit `/Encoding` from the font dict (no valid named encoding exists per PDF spec Table 5.15 — the fonts use their built-in encoding implicitly).
200+
201+
5. **Width measurement**: Fix in this plan. Extend `CHAR_TO_GLYPH` in `standard-14.ts` to cover all WinAnsi non-ASCII characters. Without this, text layout (line wrapping, centering) would be broken for accented text even if rendering is fixed.

0 commit comments

Comments
 (0)