feat(icu): add optional TextEngine plugin interface backed by ICU4X (Phase 1 — line breaking)#744
feat(icu): add optional TextEngine plugin interface backed by ICU4X (Phase 1 — line breaking)#744Vizards wants to merge 5 commits intovercel:mainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
Adds an opt-in TextEngine plugin interface to satori’s text pipeline and introduces a new icu4satori package implementing ICU4X-backed line breaking, plus a playground toggle/demo to compare engines.
Changes:
- Introduce
TextEngineinterface and threadtextEngine?through layout/text processing tosplitByBreakOpportunities(). - Add ICU4X-backed breaking path when
textEngineis provided (fallback behavior unchanged when omitted). - Add new
packages/icu4satori/package and update playground to load/enable it.
Reviewed changes
Copilot reviewed 23 out of 30 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/utils.ts | Adds textEngine/lineBreak/locale plumbing and ICU-driven break splitting path. |
| src/text/processor.ts | Threads textEngine + lineBreak through preprocessing/word-break processing. |
| src/text/index.ts | Pulls textEngine from LayoutContext and passes to preprocess(). |
| src/satori.ts | Defines/exports TextEngine and adds textEngine? to SatoriOptions; threads into context. |
| src/layout.ts | Adds textEngine? to LayoutContext and threads through recursive layout calls. |
| src/handler/inheritable.ts | Makes lineBreak inherit like in CSS. |
| src/handler/expand.ts | Adds lineBreak to MainStyle typing surface. |
| pnpm-workspace.yaml | Expands workspace to include packages/*. |
| pnpm-lock.yaml | Updates lockfile for new workspace package and dependency graph changes. |
| playground/utils/twemoji.ts | Adds notoColor emoji provider URL builder. |
| playground/pages/index.tsx | Adds ICU4X loading + toggle; adds Noto font imports; updates emoji handling. |
| playground/package.json | Adds icu4satori workspace dependency. |
| playground/decs.d.ts | Removes declare module 'satori' shim. |
| playground/cards/playground-data.ts | Adds extensive §-prefixed comparison/demo tabs. |
| packages/icu4satori/tsdown.config.ts | Adds tsdown build configuration for the new package. |
| packages/icu4satori/tsconfig.json | Adds strict TS config for the new package build/types. |
| packages/icu4satori/src/index.ts | Implements ICU4X-backed TextEngine (init, createTextEngine, getLineBreaks). |
| packages/icu4satori/src/diplomat-wasm.ts | Deferred WASM loader/shim to avoid top-level await/env detection issues. |
| packages/icu4satori/package.json | Defines new package exports, build scripts, and published files. |
| packages/icu4satori/build/lib.rs | Rust stub crate for resolving icu_capi via cargo metadata. |
| packages/icu4satori/build/ld.py | Custom wasm-ld export filtering script to shrink the WASM binary. |
| packages/icu4satori/build/Makefile | Build system to generate vendor bindings and compile pruned WASM. |
| packages/icu4satori/build/Cargo.toml | Rust dependency config for icu_capi build. |
| packages/icu4satori/build/Cargo.lock | Rust lockfile for reproducible WASM builds. |
| packages/icu4satori/README.md | Documents package rationale, usage, exports, and maintainer build steps. |
| packages/icu4satori/.gitignore | Ignores generated build outputs (vendor/target/dist/docs). |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| await init(new URL('icu4satori/wasm', import.meta.url)) | ||
|
|
||
| // 2. Load the Unicode data blob and create the text engine | ||
| const data = await fetch(new URL('icu4satori/data', import.meta.url)) |
There was a problem hiding this comment.
The usage example resolves icu4satori/wasm and icu4satori/data via new URL('icu4satori/wasm', import.meta.url) / new URL('icu4satori/data', import.meta.url), but new URL() only resolves relative to the current file URL and won’t resolve a package export specifier. This will produce an incorrect URL for consumers. Prefer showing a pattern that actually resolves package exports (e.g. import.meta.resolve('icu4satori/wasm') where available, or a bundler-specific ?url import) or document that users should host these assets and pass a URL/string to init() + fetch().
| await init(new URL('icu4satori/wasm', import.meta.url)) | |
| // 2. Load the Unicode data blob and create the text engine | |
| const data = await fetch(new URL('icu4satori/data', import.meta.url)) | |
| // | |
| // You must pass a URL/path that your runtime can actually fetch. | |
| // For example, if your bundler or app serves the WASM file at | |
| // `/static/icu4satori.wasm`: | |
| await init('/static/icu4satori.wasm') | |
| // 2. Load the Unicode data blob and create the text engine | |
| // Likewise, point this at where you serve the `.postcard` data: | |
| const data = await fetch('/static/icu4satori-data.postcard') |
| // TextEngine path: use ICU4X for all word-break modes | ||
| if (textEngine) { | ||
| const breaks = textEngine.getLineBreaks(content, { | ||
| wordBreak: (wordBreak === 'break-word' ? 'normal' : wordBreak) as | ||
| | 'normal' | ||
| | 'break-all' | ||
| | 'keep-all', | ||
| lineBreak: lineBreak as | ||
| | 'normal' | ||
| | 'loose' | ||
| | 'strict' | ||
| | 'anywhere' | ||
| | undefined, | ||
| locale, | ||
| }) | ||
|
|
||
| const words: string[] = [] | ||
| const requiredBreaks: boolean[] = [false] | ||
| let last = 0 | ||
|
|
||
| for (const brk of breaks) { | ||
| if (brk.position === 0) continue // skip ICU4X start-of-text sentinel | ||
| words.push(content.slice(last, brk.position)) | ||
| requiredBreaks.push(brk.required) | ||
| last = brk.position | ||
| } | ||
|
|
||
| return { words, requiredBreaks } | ||
| } |
There was a problem hiding this comment.
The new textEngine branch in splitByBreakOpportunities() is a distinct code path (ICU-driven segmentation + requiredBreaks semantics) but there are no tests asserting its behavior. Since the repo already has coverage around text wrapping/word-break, please add at least one focused test (e.g. unit-test this function with a mocked TextEngine, or a small screenshot test) to lock in that (1) the engine is invoked when provided and (2) mandatory breaks (e.g. \n) propagate through requiredBreaks for the ICU path.
| function processWordBreak( | ||
| content, | ||
| wordBreak: string | ||
| wordBreak: string, | ||
| lineBreak: string, | ||
| locale?: Locale, | ||
| textEngine?: TextEngine |
There was a problem hiding this comment.
processWordBreak declares lineBreak as a required string, but SerializedStyle is Partial<...> so style.lineBreak can be undefined. This is passed through from preprocess() and will either fail type-checking under strict or lead to undefined at runtime. Make lineBreak optional (or provide a default like 'normal') consistently across preprocess() → processWordBreak() → splitByBreakOpportunities().
| return { | ||
| getLineBreaks(text, options = {}) { | ||
| const locale = options.locale | ||
| ? Locale.fromString(options.locale) | ||
| : Locale.fromString('en') | ||
|
|
||
| const lbOptions = new LineBreakOptions({ | ||
| strictness: options.lineBreak ? mapLineBreak(options.lineBreak) : null, | ||
| wordOption: options.wordBreak ? mapWordBreak(options.wordBreak) : null, | ||
| }) | ||
|
|
||
| const segmenter = LineSegmenter.autoWithOptionsAndProvider( | ||
| provider, | ||
| locale, | ||
| lbOptions | ||
| ) | ||
|
|
||
| const iter = segmenter.segment(text) |
There was a problem hiding this comment.
getLineBreaks() constructs a new LineBreakOptions, Locale, and LineSegmenter on every call. In satori this function may be invoked many times per render, so repeated segmenter construction can become a hotspot. Consider caching/memoizing the segmenter by {locale, wordBreak, lineBreak} (and reusing a single Locale/LineBreakOptions instance per key) so segmenter.segment(text) is the only per-call work.
| return { | |
| getLineBreaks(text, options = {}) { | |
| const locale = options.locale | |
| ? Locale.fromString(options.locale) | |
| : Locale.fromString('en') | |
| const lbOptions = new LineBreakOptions({ | |
| strictness: options.lineBreak ? mapLineBreak(options.lineBreak) : null, | |
| wordOption: options.wordBreak ? mapWordBreak(options.wordBreak) : null, | |
| }) | |
| const segmenter = LineSegmenter.autoWithOptionsAndProvider( | |
| provider, | |
| locale, | |
| lbOptions | |
| ) | |
| const iter = segmenter.segment(text) | |
| // Cache LineSegmenter instances by effective { locale, lineBreak, wordBreak } | |
| const segmenterCache = new Map< | |
| string, | |
| { | |
| segmenter: LineSegmenter | |
| } | |
| >() | |
| return { | |
| getLineBreaks(text, options = {}) { | |
| // Derive effective option values (including defaults) for caching | |
| const localeStr = options.locale ? options.locale : 'en' | |
| const lineBreakOpt = options.lineBreak ?? '' | |
| const wordBreakOpt = options.wordBreak ?? '' | |
| const cacheKey = `${localeStr}::${lineBreakOpt}::${wordBreakOpt}` | |
| let cached = segmenterCache.get(cacheKey) | |
| if (!cached) { | |
| const locale = Locale.fromString(localeStr) | |
| const lbOptions = new LineBreakOptions({ | |
| strictness: options.lineBreak ? mapLineBreak(options.lineBreak) : null, | |
| wordOption: options.wordBreak ? mapWordBreak(options.wordBreak) : null, | |
| }) | |
| const segmenter = LineSegmenter.autoWithOptionsAndProvider( | |
| provider, | |
| locale, | |
| lbOptions | |
| ) | |
| cached = { segmenter } | |
| segmenterCache.set(cacheKey, cached) | |
| } | |
| const iter = cached.segmenter.segment(text) |
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
- Commit dist/ bundle (vendor/ is gitignored; Rust toolchain not available on Vercel or standard CI runners, so bundle artifact must be checked in, same rationale as wasm/) - Add dist exception to root .gitignore - Add prebuild script to playground so pnpm build triggers icu4satori build before next build
|
A proper CI workflow (installing Rust to regenerate both) can be added as a follow-up. But it depends on the question: Where should icu4satori live |
Summary
This PR introduces an optional
TextEngineplugin interface to satori's text processing pipeline, along with a companionicu4satoripackage that implements it using ICU4X.The motivation and full roadmap are described in the tracking issue: #743
Key design principle: zero breaking changes. The
textEngineoption is entirely opt-in. Any user who does not pass it sees no behavior change — the existinglinebreakcode path is untouched.Changes
satori core (7 files)
src/satori.tsTextEngineinterface definition; add optionaltextEngine?field toSatoriOptionssrc/layout.tstextEngine?throughLayoutContextsrc/text/index.tstextEnginefromcontextand pass it topreprocess()src/text/processor.tslineBreakandtextEngineparameters topreprocess()/processWordBreak()src/utils.tssplitByBreakOpportunities(); originallinebreakpath is unchanged whentextEngineis not providedsrc/handler/expand.tslineBreakfield toMainStylesrc/handler/inheritable.tslineBreakto the set of inheritable CSS properties (fixing missing CSS inheritance behavior)icu4satori package (new,
packages/icu4satori/)src/index.ts: exportsinit(),createTextEngine(), and theTextEngineinterfacesrc/line-break.ts: implementsgetLineBreaks()using ICU4XLineSegmenter+CodePointMapData8<LineBreak>, with support forwordBreak,lineBreak, andlocaleoptionsassets/icu_capi.wasm(~96 KB): custom-built ICU4X WASM with link-time symbol pruning, retaining only what line breaking requiresassets/data.postcard(~348 KB): Unicode data blob with LSTM model, enabling word-level line breaking for SA-class scripts (Thai, Burmese, Khmer)assets/data_simple.postcard(~29 KB): simplified data blob without LSTM, suitable for server-side Node.js environmentsSee
packages/icu4satori/README.mdfor build details.playground (6 files)
playground/package.jsonicu4satori: workspace:*dependencyplayground/utils/twemoji.tsnotoColoremoji provider (pointing to the Google Noto Emoji GitHub main branch); the existing providers (twemoji, fluent, fluentFlat, etc.) no longer cover Unicode 17, causing U16/U17 emoji in the§10 Emojitab to fail to renderplayground/pages/index.tsxloadTextEngine()loading logic, and Noto font supportplayground/public/icu_capi.wasmplayground/public/data.postcardplayground/cards/playground-data.ts§-prefixed tabs covering all affected CSS properties and script typesUsage example
Testing
15 interactive playground tabs (
§1.1through§12) are included, covering:§1.1)word-break: break-all / keep-all / break-word(§1.2–§1.4)line-break: loose / normal / strict / anywhere(§4)§10)§11,§12)Each tab has an ICU4X TextEngine toggle checkbox so you can flip between the two engines and compare output side by side.
There are no unit tests yet. I wasn't sure what the right approach would be — see Open Questions below.
Open Questions
1. Testing strategy
The existing tests in
test/are screenshot-based and require fonts and a full render pipeline. Loading ICU4X WASM asynchronously in that environment needs some setup I wasn't sure how to wire up correctly.I can go a few directions — add a dedicated screenshot test suite for the
textEnginepath in CI, or unit-testsplitByBreakOpportunities()with a mockedTextEngineto keep things simpler. Happy to follow whatever approach you prefer, or to add tests however you'd like me to structure them.2. Should the
§playground tabs be kept?These tabs serve as living documentation for the differences between ICU4X and the default engine. That said, they're noise for anyone not using ICU4X, and I'm not sure they belong in the main playground long-term. I'm happy to move them to a separate section, put them behind a flag, or remove them entirely — just let me know what feels right.
3. Should the WASM and data files be committed to the repo?
Background: the intended deployment pattern for
icu4satoriis to load the WASM and data blob from a CDN URL at runtime (e.g.init('https://cdn.example.com/icu_capi.wasm')), rather than bundling them. The playground puts these files inpublic/and loads them viawindow.location.originto faithfully reproduce that CDN loading path in the browser — so what reviewers see in the playground matches real Edge Function behavior.Currently both
playground/public/icu_capi.wasm(~96 KB) andplayground/public/data.postcard(~348 KB) are committed directly to the repo. I'm open to downloading them via a postinstall script instead, or referencing them fromicu4satori's subpath exports to avoid duplication — whatever you think fits better.4.
lineBreakinheritance fixI added
lineBreakto the inheritable CSS properties insrc/handler/inheritable.ts. This is the correct CSS behavior (line-breakis an inherited property per spec), but it's technically a behavior change — edge cases where a parent hasline-breakset and a child was previously not inheriting it could be affected. Should this be called out separately in the changelog, or is it fine to include as-is?5.
satori/fullentry pointYou mentioned
satori/fullin #687 as a possible home for heavier dependencies. I went with a separateicu4satoripackage instead, mainly because the data blob needs to be loaded on demand at runtime (not bundled) and users need to be able to choose between the LSTM and simple variants. That said, if asatori/fullentry point better fits the long-term plan I'm happy to revisit the packaging — let me know what you'd prefer.6. Where should
icu4satorilive?Right now it's a subpackage inside the satori monorepo (
packages/icu4satori/). I noticed that@shuding/opentype.js— a similar satori-adjacent package — is maintained as a standalone repo rather than a monorepo subpackage. I wasn't sure which approach you'd prefer here: keeping it in-tree makes atomic commits and CI easier, but a standalone repo gives it an independent release cadence. Happy to move it out if that's a better fit.Not included in this PR