Skip to content

Commit d3577e3

Browse files
committed
Add v3.0.0: Hierarchical break detection
Performance improvements: - Words checked only at grapheme boundaries - Sentences checked only at word boundaries - Single UTF-8 decode and classification pass - Pre-classified data reused across all break types Benchmark results (Apple M4 Pro): - Short text: 1.57x faster (3,457ns → 2,197ns) - Medium text: 1.68x faster (16,191ns → 9,636ns) - Long text: 2.24x faster (423,491ns → 188,982ns) Speedup increases with text length due to hierarchical pruning. Maintains 100% Unicode conformance: - Grapheme: 766/766 tests - Word: 1,944/1,944 tests - Sentence: 512/512 tests
1 parent 539a623 commit d3577e3

File tree

3 files changed

+1313
-0
lines changed

3 files changed

+1313
-0
lines changed

README.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -256,6 +256,42 @@ breaks := uax29.FindAllBreaks(text)
256256

257257
This provides a convenient API for applications that need multiple break types, with framework in place for future hierarchical optimization.
258258

259+
## Version 3.0.0 Performance Improvements
260+
261+
Version 3.0.0 focuses on hierarchical optimization of the single-pass API introduced in v2.0.0.
262+
263+
### Hierarchical Break Detection
264+
265+
The `FindAllBreaks()` API now implements true hierarchical checking, leveraging the natural subset relationships between break types:
266+
267+
- **Words ⊆ Graphemes**: Word breaks only checked at grapheme cluster boundaries
268+
- **Sentences ⊆ Words**: Sentence breaks only checked at word boundaries
269+
270+
This eliminates redundant checks and significantly improves performance for applications needing multiple break types.
271+
272+
### Performance Improvements
273+
274+
Benchmark results on Apple M4 Pro comparing v3.0.0 single-pass vs three separate function calls:
275+
276+
| Text Length | v2.0.0 Three Passes | v3.0.0 Single Pass | Speedup |
277+
|-------------|--------------------|--------------------|---------|
278+
| Short (33 chars) | 3,457 ns/op | 2,197 ns/op | **1.57x** |
279+
| Medium (86 chars) | 16,191 ns/op | 9,636 ns/op | **1.68x** |
280+
| Long (467 chars) | 423,491 ns/op | 188,982 ns/op | **2.24x** |
281+
282+
**Key benefits:**
283+
- Speedup increases with text length (hierarchical pruning more effective on longer text)
284+
- Single UTF-8 decode and classification pass
285+
- Pre-classified data reused across all three break types
286+
- No additional allocations compared to v2.0.0
287+
288+
### Maintained Conformance
289+
290+
100% conformance maintained on all official Unicode test suites:
291+
- Grapheme: 766/766 tests passing
292+
- Word: 1,944/1,944 tests passing
293+
- Sentence: 512/512 tests passing
294+
259295
## References
260296

261297
### Metastandards

0 commit comments

Comments
 (0)