-
Notifications
You must be signed in to change notification settings - Fork 336
[Tracking Issue] Integrate ICU4X to improve Unicode text handling in satori #743
Description
Background
satori's current text processing stack relies on several independent libraries and runtime APIs, each with known limitations:
linebreak@1.1: Unicode support stalled at Unicode 13 (2020); the package has had no npm releases in over 4 years and does not correctly handle Emoji ZWJ sequences or complex script line breaking (see issues [Proposal] Replacinglinebreakwithicu4xand a Possible Implementation #687, Update Emojis to include Unicode 15.0+ #621)Intl.Segmenter: Relies on the JavaScript runtime implementation, which behaves inconsistently across V8, JSC, SpiderMonkey, and various Edge Runtimes — and is unsupported in some environments entirelyemoji-regex-xs: Requires manual updates to track new Unicode versions; ICU4X's Emoji property query can cover the same functionality while eliminating this external dependency — happy to discuss with maintainers whether a replacement is worthwhile- Script detection in
language.ts: Hardcoded regexes per language, which does not scale well - Some capabilities are entirely absent (e.g., BiDi / right-to-left text layout)
ICU4X is the Unicode Consortium's next-generation internationalization library, already adopted by Firefox, Chrome, and Android. It provides a WebAssembly build that can address all of the above issues in a deterministic, runtime-agnostic way.
This issue tracks the incremental integration of ICU4X into satori via a companion icu4satori package.
Proposed Approach
The core design principle is: no breaking changes, fully backward compatible.
An optional textEngine?: TextEngine field is added to SatoriOptions, allowing ICU4X to be injected as a plugin. Users who do not pass textEngine see no behavior change whatsoever.
import { init, createTextEngine } from 'icu4satori'
import satori from 'satori'
await init(wasmInput)
const textEngine = createTextEngine(new Uint8Array(dataBlob))
const svg = await satori(<div>สวัสดีชาวโลก</div>, {
width: 600,
height: 400,
fonts: [...],
textEngine, // opt in to ICU4X; omit to fall back to existing behavior
})The ICU4X WASM binary (~96 KB) and Unicode data blob (~348 KB) are distributed as separate subpath exports within the icu4satori package and loaded on demand at runtime.
Work Breakdown
🚧 Phase 1: Line Breaking Replacement (Work In Progress)
A draft implementation is ready — see #744
Problem: linebreak@1.1 is frozen at Unicode 13 and does not support:
- Emoji ZWJ sequences (e.g. 👨👩👧 cannot break correctly at line end)
- Word-level line breaking for Thai, Burmese, and Khmer (SA-class characters degrade to character-level breaking)
- Correct behavior of CSS
line-break: strict/loosefor CJK text
Proposed solution: Introduce the icu4satori package wrapping ICU4X LineSegmenter (UAX#14 v15.1+). Add an optional textEngine?: TextEngine field to SatoriOptions and route splitByBreakOpportunities() through textEngine.getLineBreaks() when provided.
Planned deliverables:
icu4satoripackage:init()+createTextEngine()+TextEngineinterface definition- satori: end-to-end threading of the
textEngine?option (SatoriOptions→splitByBreakOpportunities()) - ICU4X
CodePointMapData8<LineBreak>for mandatory break detection (UAX#14 LB4/LB5) - Support for CSS
line-break(Loose/Normal/Strict/Anywhere) andword-break(Normal/BreakAll/KeepAll) - New tests covering Thai LSTM line breaking, CJK keep-all, and forced line breaks
Roadmap (Blueprint)
The following phases are directional proposals. Implementation details, sequencing, and scope are open for discussion — feedback and suggestions from anyone in the community are welcome.
Phase 2: Word & Grapheme Segmentation Replacement
Problem: The segment() function relies on Intl.Segmenter, which has the following issues:
- Incomplete or inconsistent
Intl.Segmentersupport in Cloudflare Workers, Deno Deploy, Bun, and other Edge Runtimes - Thai word boundary results differ across engines, affecting
text-overflow: ellipsistruncation positions capitalizetext-transform depends on the accuracy of grapheme segmentation
Proposed solution:
textEngine.segmentWords?(text, locale)→ ICU4XWordSegmenter(UAX#29, LSTM model)textEngine.segmentGraphemes?(text)→ ICU4XGraphemeClusterSegmenter(UAX#29)
Design note: segmentWords? and segmentGraphemes? are designed as optional methods on the TextEngine interface, sharing the same injection point as getLineBreaks — no new API surface is required.
Affected areas:
src/utils.ts:segment()functionsrc/text/index.ts: grapheme enumeration during missing-font detectionsrc/text/processor.ts: word/grapheme splitting undercapitalizemode- WASM build:
WordSegmenterandGraphemeClusterSegmentersymbols must be retained - Data blob: Word/Grapheme markers need to be included (size impact to be measured)
Phase 3: Unicode Properties Replacement
Problems:
wordSeparatorshardcodes 8 code points ([0x0020, 0x00a0, 0x1361, ...]) — the set is incomplete and missing several Unicode whitespace characters- Emoji detection relies on
emoji-regex-xs(requires manual updates per Unicode release) —ICU4X CodePointSetData.loadEmoji()can cover the same functionality and eliminate this external dependency, though whether to replace it is open for discussion - Symbol/Math detection uses JS regex
\p{Symbol}/\p{Math}, which depends on engine implementation
Proposed solution (all via ICU4X Property API, zero additional WASM exports):
wordSeparators→CodePointSetData.loadWhiteSpace(), or superseded byWordSegmenterboundary detection from Phase 2- Emoji →
CodePointSetData.loadEmoji()property query - Symbol/Math →
GeneralCategoryenum +CodePointSetData.loadMath()
Phase 4: Script Detection Refactor
Problem: detectLanguageCode() manually maintains \p{scx=...} regexes for each language, currently covering 14 languages/scripts:
- Adding a new language requires manually adding a regex — this does not scale
- No priority handling for multi-script characters (e.g. a Hiragana character matched against both Japanese and Han)
- The Unicode
ScriptExtensionsproperty (a character can belong to multiple scripts) is not accounted for at all
Proposed solution:
CodePointMapData16.loadScript()to retrieve a character's primary scriptScriptExtensionsSetto obtain the full script membership for multi-script characters- A
Script → Localemapping table (Han → zh/ja ambiguity resolved withScriptExtensionsassistance)
Benefits: Support for 200+ scripts with no manual regex maintenance; improved locale code accuracy for loadAdditionalAsset.
Phase 5: Case Mapping Replacement
Problem: processTextTransform() uses toLocaleUpperCase(locale) / toLocaleLowerCase(locale):
- Special casing rules such as Turkish
ı ↔ Iand Greekσ/ςdepend on the runtime Intl implementation capitalizemode currently requires word segmentation followed by per-grapheme uppercasing (segment(word, 'grapheme') → grapheme[0].toLocaleUpperCase()), which could be simplified
Proposed solution:
CaseMapper.lowercase(locale, text)/CaseMapper.uppercase(locale, text)TitlecaseMapper.titlecaseSegment(locale, text)to implementcapitalizedirectly
Phase 6 (Long-term / Optional): New Capabilities
The following are capabilities that ICU4X can provide but satori currently lacks entirely — these are net-new features rather than replacements:
6.1 BiDi (Bidirectional Text) Support
- Current satori code:
Yoga.calculateLayout(..., Yoga.DIRECTION_LTR)is hardcoded to LTR;text/index.tscontains a@TODO: Support RTL languagescomment - ICU4X provides
BidiClassproperties +BidiAPI (UAX#9) - This is high-complexity work: it requires cooperation at the layout engine (Yoga) level, not just ICU4X integration
- Suggest tracking in a dedicated issue
6.2 Text Normalization
- satori currently performs no Unicode normalization; composed characters (e.g.
e + ́ = é) may cause font glyph misses - ICU4X
ComposingNormalizer(NFC) can normalize text before font lookup - Suggest introducing this only in response to actual bug reports rather than proactively
6.3 Locale Enhancement
normalizeLocale()is currently implemented as a simple prefix match (e.g."zh"→"zh-CN")- ICU4X
LocaleCanonicalizer+LocaleFallbackercan provide standards-compliant BCP 47 handling - Would improve locale code accuracy for
loadAdditionalAsset - Suggest completing this alongside Phase 4 as a complementary improvement
Priority Ordering and Rationale
| Priority | Phase | Rationale |
|---|---|---|
| P1 | Phase 1 — Line Breaking | Known bugs (#621, #687) affecting all users of Thai, Emoji, and CJK text |
| P2 | Phase 2 — Word/Grapheme | Affects Edge Runtime compatibility; optional methods already reserved in TextEngine interface, making this a natural continuation of Phase 1 |
| P3 | Phase 3 — Unicode Properties | Correctness improvements, but no known bugs in current implementation; qualifies as technical debt cleanup |
| P3 | Phase 5 — Case Mapping | Same as above |
| P4 | Phase 4 — Script Detection | Broader impact (language classification for loadAdditionalAsset), but high refactor complexity |
| Long-term | Phase 6 — BiDi / Normalization | New capabilities requiring broader architectural discussion |
| Long-term | Phase 6.3 — Locale Enhancement | Complementary to Phase 4; LocaleCanonicalizer improves locale code accuracy |
Bundle Size Impact
| Phase | WASM | Data Blob | Notes |
|---|---|---|---|
| Phase 1 | ~96 KB | ~348 KB (auto) / ~29 KB (simple) | 3 datagen markers |
| After Phase 2 | incremental, TBD | TBD (Word + Grapheme markers) | rough estimate: +50–100 KB blob; to be measured |
| Full implementation | TBD | TBD | Depends on final set of enabled features |
All sizes are fully controllable via ld.py export symbol pruning + icu4x-datagen --markers-for-bin minimization.