Releases · cometkim/unicode-segmenter

29 Dec 00:02

github-actions

[email protected]

a47e430

[email protected] Latest

Latest

Patch Changes

9d482aa: Inlined the grapheme boundary checking
to avoid unnecessary function calls in the hotpath and consolidating internal state.

This achieved the runtime perf by 2% and a slight bundle size reduction.
d737dfe: Inlined the InCB=Linker checking for Indic scripts

Assets 2

14 Dec 23:25

github-actions

[email protected]

e50d821

[email protected]

Patch Changes

41a7920: Inlining more Hangul ranges (Hangul Jamo Extended-B) to reduce index memory usage (8.5KB -> 7.4KB)
Slightly improved the bundle size as well.

Assets 2

14 Dec 22:18

github-actions

[email protected]

6c8b4b4

[email protected]

Patch Changes

65c38ce: Move GB9c rule checking to be after the main boundary checking.
To try to avoid unnecessary work as much as possible.

No noticeable changes, but perf seems to be improved by ~2% for most cases.
8b23df9: Two further optimizations:
1. Remove inlined ranges from the data file.
2. Add inlined range: 0xAC00-0xD7A3 (Hangul syllables) can easily be inlined.
The 1 is something I forgot in #104 task, but it was a slight chance.

Btw, the number 2 is a huge finding. It is a pretty extensive range to be newly inlined.
Applying both optimizations significantly reduced the bundle size and memory footprint.
- Size(min): 12,549 bytes -> 6,846 bytes (-45.5%)
- Size(min+gz): 5,314 bytes -> 3,449 bytes (-35.1%)
- Index memory usage: 14,272 bytes -> 8,686 bytes (-39.2%)
Of course, without perf regression.

Assets 2

12 Dec 20:59

github-actions

[email protected]

c06976d

[email protected]

Patch Changes

b7a6e12: Optimizing grapheme break category lookup for better runtime trade-offs.

See issue for the explanation.

With this change, the library's constant memory footprint is reduced from 64 KB to 14 KB without performance regressions.
However, the code size increases slightly due to inlining. It's still relatively small.

Assets 2

05 Dec 15:00

github-actions

[email protected]

ab6ff9b

[email protected]

Patch Changes

ac96013: Removed inefficient optimization code from grapheme segmenter.

The single range cache is barely hit after the entire BMP cache is hit.
So removed it to reduce code size, and to reduce comparison count.

Worth occupying 64KB of linear memory for BMP. It should definitely be acceptable, as it still uses less heap memory size than executing graphemer's uncompressed code.

Assets 2

06 Aug 17:01

github-actions

[email protected]

5269397

[email protected]

Minor Changes

cbd1a07: Deprecated unicode-segmenter/utils entry.

Never used internally anymore. It's too simple, better to inline if needed.

Patch Changes

dbca35f: Improve runtime perf on the Unicode text processing.

By using a precomputed lookup table for the grapheme categries of BMP characters, it improves perf by more than 10% for common cases, even ~30% for some extream cases.

The lookup table consumes an additional 64 KB of memory, which is acceptable for most JavaScript runtime environments.

This optimization is introduced by OpenCode w/ OpenAI's GPT-OSS-120B. It is the second successful attempt at meaningful optimization in this library.
(The first one was the Claude Code w/ Claude Opus 4.0)
782290b: Several minor perf improvements and internal cleanup.

Even with the new optimization paths, the bundle size has barely increased.

Assets 2

30 Jul 20:13

github-actions

[email protected]

5bda67c

[email protected]

Patch Changes

f2018ed: Optimize grapheme segmenter.

By eliminating unnecessary string concatenation, it significantly improved performance when creating large segments. (e.g. Demonic, Hindi, Flags, Skin tones)
Also reduced the memory footprint by internal segment buffer.
fa9d58e: Optimize grapheme cluster boundary checking.

Assets 2

20 Jun 01:22

github-actions

[email protected]

4653a00

[email protected]

Patch Changes

88a22e2: grapheme: improve runtime perf by ~9% for most common use cases

Assets 2

20 May 00:51

github-actions

[email protected]

4b8d1d1

[email protected]

Minor Changes

75492dc: Expose an internal state: _hd;

The first codepoint of a segment, which is often need to be checked its bounds.

For example,

for (const { segment } of graphemeSegments(text)) {
  const cp = segment.codePointAt(0)!;
  // Also need to `!` assertions in TypeScript.
  if (isBMP(cp)) {
    // ...
  }
}

It can be replaced by _hd state. no additional overhead.

Patch Changes

cd63858: Export bundled entries (/bundle/*.js)

Assets 2

06 Mar 19:56

github-actions

[email protected]

209baf5

[email protected]

Minor Changes

21cd789: Removed deprecated APIs
- searchGrapheme in unicode-segmenter/grapheme
- takeChar and takeCodePoint in unicode-segmenter/utils
Which are used internally before, but never from outside.
483d258: Reduced bundle size, while keeping the best perf

Some details:
- Refactored to use the same code path internally as possible.
- Removed pre-computed jump table, the optimization were compensated for by other perf improvements.
- Previous array layout to avoid accidental de-opt turned out to be overkill. The regular tuple array is well optimized, so I fall back to using good old plain binary search.
- Some experiments like new encoding and eytzinger layout for more aggressive improvements, but no success.

Assets 2

Releases: cometkim/unicode-segmenter

Patch Changes

Uh oh!

Patch Changes

Uh oh!

Patch Changes

Uh oh!

Patch Changes

Uh oh!

Patch Changes

Uh oh!

Minor Changes

Patch Changes

Uh oh!

Patch Changes

Uh oh!

Patch Changes

Uh oh!

Minor Changes

Patch Changes

Uh oh!

Minor Changes

Uh oh!