Skip to content

[Tracking Issue] Integrate ICU4X to improve Unicode text handling in satori #743

@Vizards

Description

@Vizards

Background

satori's current text processing stack relies on several independent libraries and runtime APIs, each with known limitations:

  • linebreak@1.1: Unicode support stalled at Unicode 13 (2020); the package has had no npm releases in over 4 years and does not correctly handle Emoji ZWJ sequences or complex script line breaking (see issues [Proposal] Replacing linebreak with icu4x and a Possible Implementation #687, Update Emojis to include Unicode 15.0+ #621)
  • Intl.Segmenter: Relies on the JavaScript runtime implementation, which behaves inconsistently across V8, JSC, SpiderMonkey, and various Edge Runtimes — and is unsupported in some environments entirely
  • emoji-regex-xs: Requires manual updates to track new Unicode versions; ICU4X's Emoji property query can cover the same functionality while eliminating this external dependency — happy to discuss with maintainers whether a replacement is worthwhile
  • Script detection in language.ts: Hardcoded regexes per language, which does not scale well
  • Some capabilities are entirely absent (e.g., BiDi / right-to-left text layout)

ICU4X is the Unicode Consortium's next-generation internationalization library, already adopted by Firefox, Chrome, and Android. It provides a WebAssembly build that can address all of the above issues in a deterministic, runtime-agnostic way.

This issue tracks the incremental integration of ICU4X into satori via a companion icu4satori package.

Proposed Approach

The core design principle is: no breaking changes, fully backward compatible.

An optional textEngine?: TextEngine field is added to SatoriOptions, allowing ICU4X to be injected as a plugin. Users who do not pass textEngine see no behavior change whatsoever.

import { init, createTextEngine } from 'icu4satori'
import satori from 'satori'

await init(wasmInput)
const textEngine = createTextEngine(new Uint8Array(dataBlob))

const svg = await satori(<div>สวัสดีชาวโลก</div>, {
  width: 600,
  height: 400,
  fonts: [...],
  textEngine, // opt in to ICU4X; omit to fall back to existing behavior
})

The ICU4X WASM binary (~96 KB) and Unicode data blob (~348 KB) are distributed as separate subpath exports within the icu4satori package and loaded on demand at runtime.

Work Breakdown

🚧 Phase 1: Line Breaking Replacement (Work In Progress)

A draft implementation is ready — see #744

Problem: linebreak@1.1 is frozen at Unicode 13 and does not support:

  • Emoji ZWJ sequences (e.g. 👨‍👩‍👧 cannot break correctly at line end)
  • Word-level line breaking for Thai, Burmese, and Khmer (SA-class characters degrade to character-level breaking)
  • Correct behavior of CSS line-break: strict/loose for CJK text

Proposed solution: Introduce the icu4satori package wrapping ICU4X LineSegmenter (UAX#14 v15.1+). Add an optional textEngine?: TextEngine field to SatoriOptions and route splitByBreakOpportunities() through textEngine.getLineBreaks() when provided.

Planned deliverables:

  • icu4satori package: init() + createTextEngine() + TextEngine interface definition
  • satori: end-to-end threading of the textEngine? option (SatoriOptionssplitByBreakOpportunities())
  • ICU4X CodePointMapData8<LineBreak> for mandatory break detection (UAX#14 LB4/LB5)
  • Support for CSS line-break (Loose/Normal/Strict/Anywhere) and word-break (Normal/BreakAll/KeepAll)
  • New tests covering Thai LSTM line breaking, CJK keep-all, and forced line breaks

Roadmap (Blueprint)

The following phases are directional proposals. Implementation details, sequencing, and scope are open for discussion — feedback and suggestions from anyone in the community are welcome.

Phase 2: Word & Grapheme Segmentation Replacement

Problem: The segment() function relies on Intl.Segmenter, which has the following issues:

  • Incomplete or inconsistent Intl.Segmenter support in Cloudflare Workers, Deno Deploy, Bun, and other Edge Runtimes
  • Thai word boundary results differ across engines, affecting text-overflow: ellipsis truncation positions
  • capitalize text-transform depends on the accuracy of grapheme segmentation

Proposed solution:

  • textEngine.segmentWords?(text, locale) → ICU4X WordSegmenter (UAX#29, LSTM model)
  • textEngine.segmentGraphemes?(text) → ICU4X GraphemeClusterSegmenter (UAX#29)

Design note: segmentWords? and segmentGraphemes? are designed as optional methods on the TextEngine interface, sharing the same injection point as getLineBreaks — no new API surface is required.

Affected areas:

  • src/utils.ts: segment() function
  • src/text/index.ts: grapheme enumeration during missing-font detection
  • src/text/processor.ts: word/grapheme splitting under capitalize mode
  • WASM build: WordSegmenter and GraphemeClusterSegmenter symbols must be retained
  • Data blob: Word/Grapheme markers need to be included (size impact to be measured)

Phase 3: Unicode Properties Replacement

Problems:

  1. wordSeparators hardcodes 8 code points ([0x0020, 0x00a0, 0x1361, ...]) — the set is incomplete and missing several Unicode whitespace characters
  2. Emoji detection relies on emoji-regex-xs (requires manual updates per Unicode release) — ICU4X CodePointSetData.loadEmoji() can cover the same functionality and eliminate this external dependency, though whether to replace it is open for discussion
  3. Symbol/Math detection uses JS regex \p{Symbol} / \p{Math}, which depends on engine implementation

Proposed solution (all via ICU4X Property API, zero additional WASM exports):

  • wordSeparatorsCodePointSetData.loadWhiteSpace(), or superseded by WordSegmenter boundary detection from Phase 2
  • Emoji → CodePointSetData.loadEmoji() property query
  • Symbol/Math → GeneralCategory enum + CodePointSetData.loadMath()

Phase 4: Script Detection Refactor

Problem: detectLanguageCode() manually maintains \p{scx=...} regexes for each language, currently covering 14 languages/scripts:

  • Adding a new language requires manually adding a regex — this does not scale
  • No priority handling for multi-script characters (e.g. a Hiragana character matched against both Japanese and Han)
  • The Unicode ScriptExtensions property (a character can belong to multiple scripts) is not accounted for at all

Proposed solution:

  • CodePointMapData16.loadScript() to retrieve a character's primary script
  • ScriptExtensionsSet to obtain the full script membership for multi-script characters
  • A Script → Locale mapping table (Han → zh/ja ambiguity resolved with ScriptExtensions assistance)

Benefits: Support for 200+ scripts with no manual regex maintenance; improved locale code accuracy for loadAdditionalAsset.

Phase 5: Case Mapping Replacement

Problem: processTextTransform() uses toLocaleUpperCase(locale) / toLocaleLowerCase(locale):

  • Special casing rules such as Turkish ı ↔ I and Greek σ/ς depend on the runtime Intl implementation
  • capitalize mode currently requires word segmentation followed by per-grapheme uppercasing (segment(word, 'grapheme') → grapheme[0].toLocaleUpperCase()), which could be simplified

Proposed solution:

  • CaseMapper.lowercase(locale, text) / CaseMapper.uppercase(locale, text)
  • TitlecaseMapper.titlecaseSegment(locale, text) to implement capitalize directly

Phase 6 (Long-term / Optional): New Capabilities

The following are capabilities that ICU4X can provide but satori currently lacks entirely — these are net-new features rather than replacements:

6.1 BiDi (Bidirectional Text) Support

  • Current satori code: Yoga.calculateLayout(..., Yoga.DIRECTION_LTR) is hardcoded to LTR; text/index.ts contains a @TODO: Support RTL languages comment
  • ICU4X provides BidiClass properties + Bidi API (UAX#9)
  • This is high-complexity work: it requires cooperation at the layout engine (Yoga) level, not just ICU4X integration
  • Suggest tracking in a dedicated issue

6.2 Text Normalization

  • satori currently performs no Unicode normalization; composed characters (e.g. e + ́ = é) may cause font glyph misses
  • ICU4X ComposingNormalizer (NFC) can normalize text before font lookup
  • Suggest introducing this only in response to actual bug reports rather than proactively

6.3 Locale Enhancement

  • normalizeLocale() is currently implemented as a simple prefix match (e.g. "zh""zh-CN")
  • ICU4X LocaleCanonicalizer + LocaleFallbacker can provide standards-compliant BCP 47 handling
  • Would improve locale code accuracy for loadAdditionalAsset
  • Suggest completing this alongside Phase 4 as a complementary improvement

Priority Ordering and Rationale

Priority Phase Rationale
P1 Phase 1 — Line Breaking Known bugs (#621, #687) affecting all users of Thai, Emoji, and CJK text
P2 Phase 2 — Word/Grapheme Affects Edge Runtime compatibility; optional methods already reserved in TextEngine interface, making this a natural continuation of Phase 1
P3 Phase 3 — Unicode Properties Correctness improvements, but no known bugs in current implementation; qualifies as technical debt cleanup
P3 Phase 5 — Case Mapping Same as above
P4 Phase 4 — Script Detection Broader impact (language classification for loadAdditionalAsset), but high refactor complexity
Long-term Phase 6 — BiDi / Normalization New capabilities requiring broader architectural discussion
Long-term Phase 6.3 — Locale Enhancement Complementary to Phase 4; LocaleCanonicalizer improves locale code accuracy

Bundle Size Impact

Phase WASM Data Blob Notes
Phase 1 ~96 KB ~348 KB (auto) / ~29 KB (simple) 3 datagen markers
After Phase 2 incremental, TBD TBD (Word + Grapheme markers) rough estimate: +50–100 KB blob; to be measured
Full implementation TBD TBD Depends on final set of enabled features

All sizes are fully controllable via ld.py export symbol pruning + icu4x-datagen --markers-for-bin minimization.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions