Skip to content

feat(icu): add optional TextEngine plugin interface backed by ICU4X (Phase 1 — line breaking)#744

Open
Vizards wants to merge 5 commits intovercel:mainfrom
Vizards:feat/icu4x-text-engine
Open

feat(icu): add optional TextEngine plugin interface backed by ICU4X (Phase 1 — line breaking)#744
Vizards wants to merge 5 commits intovercel:mainfrom
Vizards:feat/icu4x-text-engine

Conversation

@Vizards
Copy link
Copy Markdown

@Vizards Vizards commented Mar 30, 2026

Summary

This PR introduces an optional TextEngine plugin interface to satori's text processing pipeline, along with a companion icu4satori package that implements it using ICU4X.

The motivation and full roadmap are described in the tracking issue: #743

Key design principle: zero breaking changes. The textEngine option is entirely opt-in. Any user who does not pass it sees no behavior change — the existing linebreak code path is untouched.

Changes

satori core (7 files)

File Change
src/satori.ts Add TextEngine interface definition; add optional textEngine? field to SatoriOptions
src/layout.ts Thread textEngine? through LayoutContext
src/text/index.ts Extract textEngine from context and pass it to preprocess()
src/text/processor.ts Add lineBreak and textEngine parameters to preprocess() / processWordBreak()
src/utils.ts Add ICU4X code path to splitByBreakOpportunities(); original linebreak path is unchanged when textEngine is not provided
src/handler/expand.ts Add lineBreak field to MainStyle
src/handler/inheritable.ts Add lineBreak to the set of inheritable CSS properties (fixing missing CSS inheritance behavior)

icu4satori package (new, packages/icu4satori/)

  • src/index.ts: exports init(), createTextEngine(), and the TextEngine interface
  • src/line-break.ts: implements getLineBreaks() using ICU4X LineSegmenter + CodePointMapData8<LineBreak>, with support for wordBreak, lineBreak, and locale options
  • assets/icu_capi.wasm (~96 KB): custom-built ICU4X WASM with link-time symbol pruning, retaining only what line breaking requires
  • assets/data.postcard (~348 KB): Unicode data blob with LSTM model, enabling word-level line breaking for SA-class scripts (Thai, Burmese, Khmer)
  • assets/data_simple.postcard (~29 KB): simplified data blob without LSTM, suitable for server-side Node.js environments

See packages/icu4satori/README.md for build details.

playground (6 files)

File Change
playground/package.json Add icu4satori: workspace:* dependency
playground/utils/twemoji.ts Add notoColor emoji provider (pointing to the Google Noto Emoji GitHub main branch); the existing providers (twemoji, fluent, fluentFlat, etc.) no longer cover Unicode 17, causing U16/U17 emoji in the §10 Emoji tab to fail to render
playground/pages/index.tsx Add ICU4X toggle checkbox, loadTextEngine() loading logic, and Noto font support
playground/public/icu_capi.wasm ICU4X WASM binary for the browser-side playground
playground/public/data.postcard Unicode data blob for the browser-side playground
playground/cards/playground-data.ts Add 15 §-prefixed tabs covering all affected CSS properties and script types

Usage example

import { init, createTextEngine } from 'icu4satori'

// Load WASM and data blob once at startup
await init(wasmBuffer)
const textEngine = createTextEngine(new Uint8Array(dataBlob))

const svg = await satori(<div>สวัสดีชาวโลก</div>, {
  width: 600,
  height: 400,
  fonts: [...],
  textEngine, // omit to fall back to the existing behavior
})

Testing

15 interactive playground tabs (§1.1 through §12) are included, covering:

  • Thai/SA-class script LSTM word breaking (§1.1)
  • word-break: break-all / keep-all / break-word (§1.2–§1.4)
  • line-break: loose / normal / strict / anywhere (§4)
  • Emoji ZWJ sequences, Unicode 15–17 skin tone modifiers (§10)
  • NBSP, empty string, pure whitespace edge cases (§11, §12)

Each tab has an ICU4X TextEngine toggle checkbox so you can flip between the two engines and compare output side by side.

There are no unit tests yet. I wasn't sure what the right approach would be — see Open Questions below.

Open Questions

1. Testing strategy

The existing tests in test/ are screenshot-based and require fonts and a full render pipeline. Loading ICU4X WASM asynchronously in that environment needs some setup I wasn't sure how to wire up correctly.

I can go a few directions — add a dedicated screenshot test suite for the textEngine path in CI, or unit-test splitByBreakOpportunities() with a mocked TextEngine to keep things simpler. Happy to follow whatever approach you prefer, or to add tests however you'd like me to structure them.

2. Should the § playground tabs be kept?

These tabs serve as living documentation for the differences between ICU4X and the default engine. That said, they're noise for anyone not using ICU4X, and I'm not sure they belong in the main playground long-term. I'm happy to move them to a separate section, put them behind a flag, or remove them entirely — just let me know what feels right.

3. Should the WASM and data files be committed to the repo?

Background: the intended deployment pattern for icu4satori is to load the WASM and data blob from a CDN URL at runtime (e.g. init('https://cdn.example.com/icu_capi.wasm')), rather than bundling them. The playground puts these files in public/ and loads them via window.location.origin to faithfully reproduce that CDN loading path in the browser — so what reviewers see in the playground matches real Edge Function behavior.

Currently both playground/public/icu_capi.wasm (~96 KB) and playground/public/data.postcard (~348 KB) are committed directly to the repo. I'm open to downloading them via a postinstall script instead, or referencing them from icu4satori's subpath exports to avoid duplication — whatever you think fits better.

4. lineBreak inheritance fix

I added lineBreak to the inheritable CSS properties in src/handler/inheritable.ts. This is the correct CSS behavior (line-break is an inherited property per spec), but it's technically a behavior change — edge cases where a parent has line-break set and a child was previously not inheriting it could be affected. Should this be called out separately in the changelog, or is it fine to include as-is?

5. satori/full entry point

You mentioned satori/full in #687 as a possible home for heavier dependencies. I went with a separate icu4satori package instead, mainly because the data blob needs to be loaded on demand at runtime (not bundled) and users need to be able to choose between the LSTM and simple variants. That said, if a satori/full entry point better fits the long-term plan I'm happy to revisit the packaging — let me know what you'd prefer.

6. Where should icu4satori live?

Right now it's a subpackage inside the satori monorepo (packages/icu4satori/). I noticed that @shuding/opentype.js — a similar satori-adjacent package — is maintained as a standalone repo rather than a monorepo subpackage. I wasn't sure which approach you'd prefer here: keeping it in-tree makes atomic commits and CI easier, but a standalone repo gives it an independent release cadence. Happy to move it out if that's a better fit.

Not included in this PR

  • README update: deferred pending resolution of Open Questions 1, 5, and 6 — the final API surface, entry point, and package name may change. Happy to add documentation once those are settled.
  • Phase 2–6: word/grapheme segmentation, Unicode property APIs, script detection, case mapping, BiDi — to be addressed in follow-up PRs as described in the tracking issue.

@Vizards Vizards requested a review from shuding as a code owner March 30, 2026 09:16
Copilot AI review requested due to automatic review settings March 30, 2026 09:16
@vercel
Copy link
Copy Markdown
Contributor

vercel bot commented Mar 30, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
satori-playground Ready Ready Preview, Comment, Open in v0 Mar 30, 2026 10:18am

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in TextEngine plugin interface to satori’s text pipeline and introduces a new icu4satori package implementing ICU4X-backed line breaking, plus a playground toggle/demo to compare engines.

Changes:

  • Introduce TextEngine interface and thread textEngine? through layout/text processing to splitByBreakOpportunities().
  • Add ICU4X-backed breaking path when textEngine is provided (fallback behavior unchanged when omitted).
  • Add new packages/icu4satori/ package and update playground to load/enable it.

Reviewed changes

Copilot reviewed 23 out of 30 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/utils.ts Adds textEngine/lineBreak/locale plumbing and ICU-driven break splitting path.
src/text/processor.ts Threads textEngine + lineBreak through preprocessing/word-break processing.
src/text/index.ts Pulls textEngine from LayoutContext and passes to preprocess().
src/satori.ts Defines/exports TextEngine and adds textEngine? to SatoriOptions; threads into context.
src/layout.ts Adds textEngine? to LayoutContext and threads through recursive layout calls.
src/handler/inheritable.ts Makes lineBreak inherit like in CSS.
src/handler/expand.ts Adds lineBreak to MainStyle typing surface.
pnpm-workspace.yaml Expands workspace to include packages/*.
pnpm-lock.yaml Updates lockfile for new workspace package and dependency graph changes.
playground/utils/twemoji.ts Adds notoColor emoji provider URL builder.
playground/pages/index.tsx Adds ICU4X loading + toggle; adds Noto font imports; updates emoji handling.
playground/package.json Adds icu4satori workspace dependency.
playground/decs.d.ts Removes declare module 'satori' shim.
playground/cards/playground-data.ts Adds extensive §-prefixed comparison/demo tabs.
packages/icu4satori/tsdown.config.ts Adds tsdown build configuration for the new package.
packages/icu4satori/tsconfig.json Adds strict TS config for the new package build/types.
packages/icu4satori/src/index.ts Implements ICU4X-backed TextEngine (init, createTextEngine, getLineBreaks).
packages/icu4satori/src/diplomat-wasm.ts Deferred WASM loader/shim to avoid top-level await/env detection issues.
packages/icu4satori/package.json Defines new package exports, build scripts, and published files.
packages/icu4satori/build/lib.rs Rust stub crate for resolving icu_capi via cargo metadata.
packages/icu4satori/build/ld.py Custom wasm-ld export filtering script to shrink the WASM binary.
packages/icu4satori/build/Makefile Build system to generate vendor bindings and compile pruned WASM.
packages/icu4satori/build/Cargo.toml Rust dependency config for icu_capi build.
packages/icu4satori/build/Cargo.lock Rust lockfile for reproducible WASM builds.
packages/icu4satori/README.md Documents package rationale, usage, exports, and maintainer build steps.
packages/icu4satori/.gitignore Ignores generated build outputs (vendor/target/dist/docs).
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +24 to +27
await init(new URL('icu4satori/wasm', import.meta.url))

// 2. Load the Unicode data blob and create the text engine
const data = await fetch(new URL('icu4satori/data', import.meta.url))
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage example resolves icu4satori/wasm and icu4satori/data via new URL('icu4satori/wasm', import.meta.url) / new URL('icu4satori/data', import.meta.url), but new URL() only resolves relative to the current file URL and won’t resolve a package export specifier. This will produce an incorrect URL for consumers. Prefer showing a pattern that actually resolves package exports (e.g. import.meta.resolve('icu4satori/wasm') where available, or a bundler-specific ?url import) or document that users should host these assets and pass a URL/string to init() + fetch().

Suggested change
await init(new URL('icu4satori/wasm', import.meta.url))
// 2. Load the Unicode data blob and create the text engine
const data = await fetch(new URL('icu4satori/data', import.meta.url))
//
// You must pass a URL/path that your runtime can actually fetch.
// For example, if your bundler or app serves the WASM file at
// `/static/icu4satori.wasm`:
await init('/static/icu4satori.wasm')
// 2. Load the Unicode data blob and create the text engine
// Likewise, point this at where you serve the `.postcard` data:
const data = await fetch('/static/icu4satori-data.postcard')

Copilot uses AI. Check for mistakes.
Comment on lines +383 to +411
// TextEngine path: use ICU4X for all word-break modes
if (textEngine) {
const breaks = textEngine.getLineBreaks(content, {
wordBreak: (wordBreak === 'break-word' ? 'normal' : wordBreak) as
| 'normal'
| 'break-all'
| 'keep-all',
lineBreak: lineBreak as
| 'normal'
| 'loose'
| 'strict'
| 'anywhere'
| undefined,
locale,
})

const words: string[] = []
const requiredBreaks: boolean[] = [false]
let last = 0

for (const brk of breaks) {
if (brk.position === 0) continue // skip ICU4X start-of-text sentinel
words.push(content.slice(last, brk.position))
requiredBreaks.push(brk.required)
last = brk.position
}

return { words, requiredBreaks }
}
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new textEngine branch in splitByBreakOpportunities() is a distinct code path (ICU-driven segmentation + requiredBreaks semantics) but there are no tests asserting its behavior. Since the repo already has coverage around text wrapping/word-break, please add at least one focused test (e.g. unit-test this function with a mocked TextEngine, or a small screenshot test) to lock in that (1) the engine is invoked when provided and (2) mandatory breaks (e.g. \n) propagate through requiredBreaks for the ICU path.

Copilot uses AI. Check for mistakes.
Comment on lines 119 to +124
function processWordBreak(
content,
wordBreak: string
wordBreak: string,
lineBreak: string,
locale?: Locale,
textEngine?: TextEngine
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

processWordBreak declares lineBreak as a required string, but SerializedStyle is Partial<...> so style.lineBreak can be undefined. This is passed through from preprocess() and will either fail type-checking under strict or lead to undefined at runtime. Make lineBreak optional (or provide a default like 'normal') consistently across preprocess()processWordBreak()splitByBreakOpportunities().

Copilot uses AI. Check for mistakes.
Comment on lines +124 to +141
return {
getLineBreaks(text, options = {}) {
const locale = options.locale
? Locale.fromString(options.locale)
: Locale.fromString('en')

const lbOptions = new LineBreakOptions({
strictness: options.lineBreak ? mapLineBreak(options.lineBreak) : null,
wordOption: options.wordBreak ? mapWordBreak(options.wordBreak) : null,
})

const segmenter = LineSegmenter.autoWithOptionsAndProvider(
provider,
locale,
lbOptions
)

const iter = segmenter.segment(text)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getLineBreaks() constructs a new LineBreakOptions, Locale, and LineSegmenter on every call. In satori this function may be invoked many times per render, so repeated segmenter construction can become a hotspot. Consider caching/memoizing the segmenter by {locale, wordBreak, lineBreak} (and reusing a single Locale/LineBreakOptions instance per key) so segmenter.segment(text) is the only per-call work.

Suggested change
return {
getLineBreaks(text, options = {}) {
const locale = options.locale
? Locale.fromString(options.locale)
: Locale.fromString('en')
const lbOptions = new LineBreakOptions({
strictness: options.lineBreak ? mapLineBreak(options.lineBreak) : null,
wordOption: options.wordBreak ? mapWordBreak(options.wordBreak) : null,
})
const segmenter = LineSegmenter.autoWithOptionsAndProvider(
provider,
locale,
lbOptions
)
const iter = segmenter.segment(text)
// Cache LineSegmenter instances by effective { locale, lineBreak, wordBreak }
const segmenterCache = new Map<
string,
{
segmenter: LineSegmenter
}
>()
return {
getLineBreaks(text, options = {}) {
// Derive effective option values (including defaults) for caching
const localeStr = options.locale ? options.locale : 'en'
const lineBreakOpt = options.lineBreak ?? ''
const wordBreakOpt = options.wordBreak ?? ''
const cacheKey = `${localeStr}::${lineBreakOpt}::${wordBreakOpt}`
let cached = segmenterCache.get(cacheKey)
if (!cached) {
const locale = Locale.fromString(localeStr)
const lbOptions = new LineBreakOptions({
strictness: options.lineBreak ? mapLineBreak(options.lineBreak) : null,
wordOption: options.wordBreak ? mapWordBreak(options.wordBreak) : null,
})
const segmenter = LineSegmenter.autoWithOptionsAndProvider(
provider,
locale,
lbOptions
)
cached = { segmenter }
segmenterCache.set(cacheKey, cached)
}
const iter = cached.segmenter.segment(text)

Copilot uses AI. Check for mistakes.
@socket-security
Copy link
Copy Markdown

socket-security bot commented Mar 30, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedcargo/​icu_capi@​2.1.272100100100100
Addednpm/​tsdown@​0.21.6981008896100
Addednpm/​typescript@​5.9.31001009010090

View full report

- Commit dist/ bundle (vendor/ is gitignored; Rust toolchain not
  available on Vercel or standard CI runners, so bundle artifact
  must be checked in, same rationale as wasm/)
- Add dist exception to root .gitignore
- Add prebuild script to playground so pnpm build triggers
  icu4satori build before next build
@Vizards
Copy link
Copy Markdown
Author

Vizards commented Mar 30, 2026

dist is committed because the build requires the diplomat-generated vendor/ bindings, which in turn require a Rust toolchain that hosted CI environments (Vercel, standard GitHub Actions) don't provide.

A proper CI workflow (installing Rust to regenerate both) can be added as a follow-up. But it depends on the question: Where should icu4satori live

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants