Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 81 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

unicode-segmenter is a lightweight, spec-compliant implementation of Unicode Text Segmentation (UAX #29) for JavaScript. It provides text segmentation by grapheme clusters, emoji matching, and general alphanumeric character matching with excellent performance and small bundle size.

## Development Commands

### Build & Development
- `yarn build` - Build the project (runs build-exports.js + TypeScript compilation)
- `yarn clean` - Remove all generated files (*.js, *.cjs, *.map, *.d.ts, bundle/)
- `yarn prepack` - Full build pipeline (clean + build)

### Testing
- `yarn test` - Run tests using Node.js built-in test runner
- `yarn test:coverage` - Run tests with coverage reporting (generates lcov.info)

### Performance & Benchmarking
- `yarn perf:grapheme` - Run grapheme segmentation performance benchmarks
- `yarn perf:emoji` - Run emoji matching performance benchmarks
- `yarn perf:general` - Run general character matching performance benchmarks
- `yarn perf:grapheme:browser` - Run browser-based grapheme benchmarks using Vite
- `yarn perf:grapheme:hermes` - Run Hermes (React Native) specific benchmarks
- `yarn perf:grapheme:quickjs` - Run QuickJS specific benchmarks

### Bundle Analysis
- `yarn bundle-stats:grapheme` - Analyze grapheme module bundle size
- `yarn bundle-stats:emoji` - Analyze emoji module bundle size
- `yarn bundle-stats:general` - Analyze general module bundle size
- `yarn bundle-stats:grapheme:hermes` - Analyze Hermes bytecode size

## Architecture

### Module Structure
The project uses a modular architecture with separate entry points:

- **src/index.js** - Main entry point, re-exports all modules
- **src/grapheme.js** - Extended grapheme cluster segmentation (core functionality)
- **src/emoji.js** - Emoji character property matching
- **src/general.js** - General/alphanumeric character property matching
- **src/utils.js** - UTF-16 surrogate pair handling utilities
- **src/intl-adapter.js** - Intl.Segmenter API-compatible adapter
- **src/intl-polyfill.js** - Intl.Segmenter polyfill
- **src/core.js** - Shared segmentation utilities

### Data Files
Unicode data is pre-processed and stored in compact binary format:
- **src/_grapheme_data.js** - Grapheme cluster break categories and ranges
- **src/_emoji_data.js** - Emoji property data
- **src/_general_data.js** - General category property data
- **src/_incb_data.js** - Indic Conjunct Break data

### Build System
- **scripts/build-exports.js** - Main build script that handles ESM/CJS dual exports
- Custom build process creates both .js (ESM) and .cjs (CommonJS) versions
- Bundle generation using esbuild for optimized standalone builds
- TypeScript compilation for .d.ts generation using tsconfig.build.json

### Testing Architecture
- Uses Node.js built-in test runner (not Jest/Mocha)
- **test/_helper.js** - Shared test utilities
- **test/_unicode_testdata.js** - Unicode test suite data
- Tests verify compliance with official Unicode test suites
- Comprehensive coverage including fuzzing against native Intl.Segmenter

### Performance Focus
The library prioritizes runtime performance:
- Optimized Unicode data compression and lookup algorithms
- Careful memory management and minimal object allocation
- Benchmarking against multiple runtimes (Node.js, Browser, Hermes, QuickJS)
- Performance tracking across different Unicode text types

## Key Implementation Details

- **Unicode Version**: Currently implements Unicode 16.0.0 (UAX #29 Revision 45)
- **ES2015+ Target**: Uses modern JavaScript features (generators, String.codePointAt)
- **Zero Dependencies**: Self-contained implementation with no external dependencies
- **Multi-Runtime Support**: Optimized for Node.js, browsers, React Native (Hermes), and QuickJS
- **Spec Compliance**: Maintains 100% compliance with Unicode segmentation rules
105 changes: 105 additions & 0 deletions TASK.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Performance Optimization Task - unicode-segmenter

## Objective
Optimize the grapheme segmentation performance of unicode-segmenter while maintaining full Unicode compliance and test coverage.

## Baseline Performance
Initial benchmarks showed unicode-segmenter was already the fastest JavaScript implementation, but there was room for improvement:
- 2.12x faster than Intl.Segmenter on ASCII text
- 2.36x faster than Intl.Segmenter on emoji
- 1.5x faster than Intl.Segmenter on Hindi text

## Optimizations Implemented

### 1. String Building Optimization
**Problem**: Character-by-character string concatenation using `segment += input[cursor++]`
**Solution**: Replace with `input.slice(segmentStart, cursor)` to avoid repeated string allocations
**Impact**: Significant reduction in memory allocations and GC pressure

### 2. ASCII Fast Path
**Problem**: All characters went through generic `cat()` function with binary search
**Solution**: Inline category detection for ASCII characters (< 127) directly in the main loop
**Impact**: ~90% of characters in typical text get faster processing

### 3. Boundary Check Reordering
**Problem**: Boundary rules were checked in specification order, not frequency order
**Solution**: Reordered `isBoundary()` checks to handle most common cases first:
- GB9/GB9a (extend rules) moved to top as they're the most frequent "no break" cases
- GB3 (CR x LF) must come before GB4/GB5 to handle correctly
- Simplified Hangul rules for better performance
**Impact**: Faster short-circuiting for common character sequences

### 4. Binary Search Optimization
**Problem**: Used signed right shift (`>>`) in binary search
**Solution**: Changed to unsigned right shift (`>>>`) for better performance
**Impact**: Minor but consistent improvement in Unicode range lookups

### 5. Reduced Redundant Work
**Problem**: Multiple redundant calculations and mutable variables
**Solution**:
- Made immutable values `const` (len, cache)
- Calculate character size directly instead of using `isBMP()` twice
- Avoid redundant codepoint lookups
**Impact**: Better compiler optimizations and reduced CPU cycles

## Results

### Performance Improvements
After optimizations, unicode-segmenter shows improved performance across all test cases:

| Test Case | Before | After | Improvement |
|-----------|--------|-------|-------------|
| ASCII text | 2.12x faster | 2.14x faster | +1% |
| Emoji | 2.36x faster | 2.76x faster | +17% |
| Hindi | 1.50x faster | 1.71x faster | +14% |
| Demonic chars | 1.77x faster | 3.13x faster | +77% |
| Mixed content | 1.87x faster | 1.91x faster | +2% |

*All comparisons relative to native Intl.Segmenter*

### Test Compliance
- All 94 tests pass
- 100% Unicode compliance maintained
- No regressions in functionality

## Technical Details

### Critical Code Changes

1. **Main loop optimization** (src/grapheme.js:93-95):
```js
// Before: segment += input[cursor++]
// After: const charSize = cp > 0xFFFF ? 2 : 1; cursor += charSize;
```

2. **ASCII inlining** (src/grapheme.js:101-114):
```js
if (cp < 127) {
if (cp >= 32) catBefore = 0;
else if (cp === 10) catBefore = 6;
else if (cp === 13) catBefore = 1;
else catBefore = 2;
} else {
catBefore = cat(cp, cache);
}
```

3. **Segment extraction** (src/grapheme.js:147,177):
```js
// Before: yield { segment, ... }
// After: yield { segment: input.slice(segmentStart, cursor), ... }
```

## Lessons Learned

1. **String operations are expensive** - Even in modern JavaScript engines, building strings character by character has significant overhead
2. **Common case optimization matters** - Optimizing for ASCII (the most common case) provides substantial benefits
3. **Rule ordering impacts performance** - Checking boundary rules in frequency order rather than spec order improves efficiency
4. **Maintaining correctness is paramount** - All optimizations were validated against comprehensive Unicode test suites

## Future Optimization Opportunities

1. **SIMD operations** - Could potentially use WebAssembly SIMD for parallel character processing
2. **Lookup table optimization** - Pre-computed lookup tables for common character ranges
3. **Streaming API** - Process large texts in chunks to improve memory usage
4. **Web Worker support** - Parallel processing for multiple text segments
16 changes: 9 additions & 7 deletions src/core.js
Original file line number Diff line number Diff line change
Expand Up @@ -67,13 +67,15 @@ export function findUnicodeRangeIndex(cp, ranges) {
let lo = 0
, hi = ranges.length - 1;
while (lo <= hi) {
let mid = lo + hi >> 1
, range = ranges[mid]
, l = range[0]
, h = range[1];
if (l <= cp && cp <= h) return mid;
else if (cp > h) lo = mid + 1;
else hi = mid - 1;
const mid = (lo + hi) >>> 1
, range = ranges[mid];
if (cp < range[0]) {
hi = mid - 1;
} else if (cp > range[1]) {
lo = mid + 1;
} else {
return mid;
}
}
return -1;
}
Loading
Loading