Releases: cometkim/unicode-segmenter
[email protected]
[email protected]
Patch Changes
- 41a7920: Inlining more Hangul ranges (Hangul Jamo Extended-B) to reduce index memory usage (8.5KB -> 7.4KB)
Slightly improved the bundle size as well.
[email protected]
Patch Changes
-
65c38ce: Move GB9c rule checking to be after the main boundary checking.
To try to avoid unnecessary work as much as possible.No noticeable changes, but perf seems to be improved by ~2% for most cases.
-
8b23df9: Two further optimizations:
- Remove inlined ranges from the data file.
- Add inlined range: 0xAC00-0xD7A3 (Hangul syllables) can easily be inlined.
The 1 is something I forgot in #104 task, but it was a slight chance.
Btw, the number 2 is a huge finding. It is a pretty extensive range to be newly inlined.
Applying both optimizations significantly reduced the bundle size and memory footprint.- Size(min): 12,549 bytes -> 6,846 bytes (-45.5%)
- Size(min+gz): 5,314 bytes -> 3,449 bytes (-35.1%)
- Index memory usage: 14,272 bytes -> 8,686 bytes (-39.2%)
Of course, without perf regression.
[email protected]
Patch Changes
-
b7a6e12: Optimizing grapheme break category lookup for better runtime trade-offs.
See issue for the explanation.
With this change, the library's constant memory footprint is reduced from 64 KB to 14 KB without performance regressions.
However, the code size increases slightly due to inlining. It's still relatively small.
[email protected]
Patch Changes
-
ac96013: Removed inefficient optimization code from grapheme segmenter.
The single range cache is barely hit after the entire BMP cache is hit.
So removed it to reduce code size, and to reduce comparison count.Worth occupying 64KB of linear memory for BMP. It should definitely be acceptable, as it still uses less heap memory size than executing graphemer's uncompressed code.
[email protected]
Minor Changes
-
cbd1a07: Deprecated
unicode-segmenter/utilsentry.Never used internally anymore. It's too simple, better to inline if needed.
Patch Changes
-
dbca35f: Improve runtime perf on the Unicode text processing.
By using a precomputed lookup table for the grapheme categries of BMP characters, it improves perf by more than 10% for common cases, even ~30% for some extream cases.
The lookup table consumes an additional 64 KB of memory, which is acceptable for most JavaScript runtime environments.
This optimization is introduced by OpenCode w/ OpenAI's GPT-OSS-120B. It is the second successful attempt at meaningful optimization in this library.
(The first one was the Claude Code w/ Claude Opus 4.0) -
782290b: Several minor perf improvements and internal cleanup.
Even with the new optimization paths, the bundle size has barely increased.
[email protected]
Patch Changes
-
f2018ed: Optimize grapheme segmenter.
By eliminating unnecessary string concatenation, it significantly improved performance when creating large segments. (e.g. Demonic, Hindi, Flags, Skin tones)
Also reduced the memory footprint by internal segment buffer. -
fa9d58e: Optimize grapheme cluster boundary checking.
[email protected]
Patch Changes
- 88a22e2: grapheme: improve runtime perf by ~9% for most common use cases
[email protected]
Minor Changes
-
75492dc: Expose an internal state:
_hd;The first codepoint of a segment, which is often need to be checked its bounds.
For example,
for (const { segment } of graphemeSegments(text)) { const cp = segment.codePointAt(0)!; // Also need to `!` assertions in TypeScript. if (isBMP(cp)) { // ... } }
It can be replaced by
_hdstate. no additional overhead.
Patch Changes
- cd63858: Export bundled entries (
/bundle/*.js)
[email protected]
Minor Changes
-
21cd789: Removed deprecated APIs
searchGraphemeinunicode-segmenter/graphemetakeCharandtakeCodePointinunicode-segmenter/utils
Which are used internally before, but never from outside.
-
483d258: Reduced bundle size, while keeping the best perf
Some details:
- Refactored to use the same code path internally as possible.
- Removed pre-computed jump table, the optimization were compensated for by other perf improvements.
- Previous array layout to avoid accidental de-opt turned out to be overkill. The regular tuple array is well optimized, so I fall back to using good old plain binary search.
- Some experiments like new encoding and eytzinger layout for more aggressive improvements, but no success.