Releases: SCKelemen/unicode
v6.0.0: Memory Optimization and ASCII Fast Paths
Version 6.0.0: Memory Optimization and ASCII Fast Paths
Version 6.0.0 focuses on memory optimization and ASCII fast paths to dramatically improve performance for common cases while maintaining 100% Unicode conformance.
🚀 Performance Improvements
ASCII Fast Paths (100x+ Speedups)
UTS #15 (Normalization):
- ASCII NFC normalization: 129x faster (7.68 ns/op vs 995 ns/op)
- ASCII NFKC normalization: 144x faster (7.72 ns/op vs 1,115 ns/op)
- 🎯 ASCII text normalization is essentially FREE (single
isASCII()check)
UTS #39 (Security):
- ASCII mixed-script check: 34x faster (4.18 ns/op vs 142 ns/op)
- ASCII safe identifier check: 3.7x faster (74.7 ns/op vs 277 ns/op)
Unicode Text Improvements
UTS #39 (Security):
- Skeleton algorithm: 2.5x faster (174 ns/op vs 430 ns/op)
- Confusable detection: 1.7x faster (502 ns/op vs 874 ns/op)
UTS #15 (Normalization):
- NFKC: 8% faster (5,390 ns/op vs 5,877 ns/op)
- NFKD: 6% faster (3,135 ns/op vs 3,337 ns/op)
💾 Memory Improvements
Type Size Reductions
| Component | Before | After | Savings |
|---|---|---|---|
| combiningClassMap (UTS #15) | ~15.5 KB | ~7.75 KB | 50% (7.75 KB) |
| Script type (UAX #24) | 8 bytes/value | 1 byte/value | 87.5% (7 bytes) |
| BreakClass type (UAX #14) | 8 bytes/value | 1 byte/value | 87.5% (7 bytes) |
🎯 All runtime structures using these types are 50-87.5% smaller with better CPU cache density.
🔧 Technical Changes
Type Reductions
-
UTS #15:
combiningClassMapchanged frommap[rune]inttomap[rune]uint8- Unicode combining classes range 0-240, fit perfectly in uint8 (0-255)
-
UAX #24:
Scripttype changed frominttouint8- 176 Unicode scripts fit comfortably in uint8 (0-255)
-
UAX #14:
BreakClasstype changed frominttouint8- 66 break classes fit in uint8 (0-255)
ASCII Fast Paths
UTS #15 (Normalization):
- Added
isASCII()check to NFC, NFD, NFKC, NFKD functions - ASCII text is already normalized in all forms
- Avoids expensive decomposition/composition operations
UTS #39 (Security):
- ASCII fast paths in
IsMixedScript()- ASCII is single-script (Latin) - ASCII fast paths in
IsSafeIdentifier()- ASCII identifiers only need validation - Skips expensive script analysis for common identifiers
🌍 Real-World Impact
Typical web application (mostly ASCII identifiers):
- Variable name validation: 34x faster
- URL normalization: 129x faster
- Username security checks: 3.7x faster
International text (mixed Unicode):
- Confusable detection: 2.5x faster
- Text normalization: 3-8% faster
- Security validation: 1.7x faster
✅ Conformance Maintained
100% conformance maintained on all official Unicode test suites:
- UTS #15: 20,034/20,034 normalization tests passing
- UAX #24: 159,866/159,866 script property tests passing
- UTS #39: 6,565/6,565 confusable mappings verified
- Total: All 207,333 tests passing
🎯 Key Benefits
✅ ASCII normalization: 129-144x faster (essentially free)
✅ ASCII security checks: 34x faster
✅ Skeleton algorithm: 2.5x faster for all text
✅ Confusable detection: 1.7x faster for all text
✅ Memory footprint: ~15-20 KB saved, 50-87.5% reduction in type sizes
✅ Conformance: 100% maintained
📝 Design Philosophy
The optimizations excel at what matters most:
- Common case (ASCII) is blazingly fast (100x+ speedups)
- Full Unicode support still provides solid improvements (1.7-2.5x)
- 100% correctness maintained everywhere
🔨 Breaking Changes
None. All changes are backwards compatible.
📦 Installation
go get github.com/SCKelemen/unicode/[email protected]
go get github.com/SCKelemen/unicode/[email protected]
go get github.com/SCKelemen/unicode/[email protected]🙏 Benchmarks
All benchmarks run on Apple M4 Pro. See the README for detailed benchmark results and methodology.
v5.0.0 - Rule-Based Line Breaking Architecture 🐻
v5.0.0 - Rule-Based Line Breaking Architecture
🎯 Major Achievement: 100% UAX #14 Conformance
This release extends the rule-based state machine architecture from UAX #29 (v4.0.0) to UAX #14 (Line Breaking Algorithm), achieving 100% conformance on all 19,338 official Unicode tests.
✨ What's New
Rule-Based Line Breaking Implementation
UAX #14 now uses a clean, rule-based architecture that directly maps to the Unicode Standard specification:
-
LineBreakContext abstraction: Clean navigation API with helper methods
SkipBackward/SkipForward: Skip over combining marks (LB9 rule)FindForward/FindBackward: Search for target classesMatchSequence: Pattern matching for rule sequences
-
59 Named rule functions: Each Unicode rule (LB4, LB5, LB8, LB21, etc.) becomes a named, testable function
-
Declarative rule chains: First-match-wins strategy with clear precedence
-
Pair table fallback: Common cases handled by efficient 2,064-entry lookup table
100% Conformance Fixes
Achieved perfect conformance by fixing these edge cases:
-
French guillemet separators (
»word«pattern)- Pattern: « SP ÷ AL when part of emphasis, not quotation
- U+00AB/U+00BB require special break handling
-
German quotes (
„..."and‚...'patterns)- ClassQU_Pi acts as closing quote (not opening)
- U+201E/U+201A (ClassOP) open, U+201C/U+2018 (ClassQU_Pi) close
-
Hebrew MAQAF (U+05BE hyphen)
- HL × HH ÷ HL pattern for Hebrew hyphen
- New
ruleLB21_HH_Breakhandles (HL | AL) × HH ÷ HL
-
Regional indicators with combining marks
- RI × CM × RI sequences
ruleLB30anow skips CM/ZWJ when counting RIs
-
Extended pictographic × emoji modifier
- Reserved emoji ranges (U+1F000-U+1FFFD)
ruleLB30bchecksisExtendedPictographicfor any base class
📊 Test Results
Total tests: 19,338
Passed: 19,338 (100.0%)
Failed: 0 (0.0%)
🏗️ Architecture Benefits
Before (Original Implementation)
- 1,112-line monolithic function
- Complex inline conditionals
- Difficult to debug and extend
After (Rule-Based Implementation)
- Isolated, independently testable rule functions
- Direct spec mapping (ruleLB4, ruleLB21, etc.)
- Clear documentation with spec links
- Easy to add new rules without refactoring
- No massive conditional chains
⚡ Performance Impact
The rule-based implementation is 2-3x slower due to abstraction overhead:
| Text Length | Original | Rule-Based | Change |
|---|---|---|---|
| Short (10 chars) | 494 ns/op | 1,360 ns/op | 2.75x slower |
| Medium (64 chars) | 3,934 ns/op | 9,374 ns/op | 2.38x slower |
| Long (45 chars) | 2,138 ns/op | 5,209 ns/op | 2.44x slower |
Trade-off: Performance remains excellent for text layout (thousands of characters per millisecond). The maintainability benefits far outweigh the performance cost for this use case.
🐻 License Update
Updated to BearWare 1.0 - MIT License with bear emojis:
- Less corporate feel
- Easy to detect in the wild
- Shows we're weekend warriors, not a corporation
📦 New Files
uax14/context.go- LineBreakContext abstraction (354 lines)uax14/linebreak_rules.go- Rule-based implementation (1,786 lines, 59 rule functions)uax14/linebreak_rules_test.go- Test suite with conformance testsuax14/LINEBREAK_RULES.md- Comprehensive rule documentationLICENSE- BearWare 1.0 license with bear emoji ASCII art
🔧 Breaking Changes
None - the original implementation remains available as FindLineBreakOpportunities. The new rule-based implementation is exposed via FindLineBreakOpportunitiesWithRules for testing and comparison.
🎓 What This Means
This architecture provides:
- Direct spec mapping: Rule functions named after Unicode spec rules
- Independent testing: Each rule can be tested and traced independently
- Clear debugging: Rule execution can be logged to understand break decisions
- Easy updates: New Unicode versions can add rules without refactoring
- Reduced complexity: No massive conditional chains or inline state tracking
This matches the successful pattern from UAX #29 v4.0.0, providing consistency across the codebase.
🔗 References
🙏 Acknowledgments
This release demonstrates rigorous engineering while maintaining a personal, accessible approach. Made with care by weekend warriors. 🐻
Full Changelog: v4.0.0...v5.0.0
v4.0.0: Rule-Based State Machine Architecture
Version 4.0.0: Rule-Based State Machine Architecture
This release focuses on code quality and maintainability through rule-based state machine architecture for all break detection algorithms.
New Features
- BreakContext abstractions:
GraphemeBreakContext,WordBreakContext,SentenceBreakContextprovide clean navigation APIs - Named rule functions: Each Unicode rule (GB3, WB5, SB8, etc.) becomes a named function with clear semantics
- Declarative rule chains: Rules checked in order with first-match-wins strategy
- Maintained hierarchical optimization: Words checked only at grapheme boundaries, sentences only at word boundaries
Code Organization
New files implementing the rule-based architecture:
context.go- Break context abstractions with navigation methods (661 lines)grapheme_rules.go- Grapheme breaking rules (ruleGB3 through ruleGB12_13, 308 lines)word_rules.go- Word breaking rules (ruleWB3 through ruleWB15_16, 376 lines)sentence_rules.go- Sentence breaking rules (ruleSB3 through ruleSB11, 244 lines)single_pass.go- Cleaned up to use rule-based implementations (96 lines vs 574 lines)
Performance (Apple M4 Pro)
Rule-based grapheme breaking alone:
| Text Length | v3.0.0 Inline | v4.0.0 Rule-Based | Speedup |
|---|---|---|---|
| Short (33 chars) | 1,882 ns/op | 1,183 ns/op | 1.59x |
| Medium (86 chars) | 8,759 ns/op | 3,041 ns/op | 2.88x |
| Long (467 chars) | 168,060 ns/op | 15,170 ns/op | 11.08x |
Single-Pass API:
| Text Length | v3.0.0 Inline | v4.0.0 Rule-Based | Change |
|---|---|---|---|
| Short (33 chars) | 2,197 ns/op | 2,717 ns/op | 1.24x slower |
| Medium (86 chars) | 9,636 ns/op | 6,647 ns/op | 1.45x faster |
| Long (467 chars) | 188,982 ns/op | 32,200 ns/op | 5.87x faster |
Single-Pass vs Three Separate Passes (v4.0.0):
| Text Length | Single Pass | Three Separate | Speedup |
|---|---|---|---|
| Short (33 chars) | 2,717 ns/op | 3,380 ns/op | 1.24x |
| Medium (86 chars) | 6,647 ns/op | 14,312 ns/op | 2.15x |
| Long (467 chars) | 32,200 ns/op | 239,624 ns/op | 7.44x |
Key findings:
- Rule-based grapheme breaking provides 1.6-11x speedup over inline implementation
- Performance improvements increase dramatically with text length
- Single-pass API maintains significant advantage over three separate calls
- Medium and long texts benefit most from rule-based architecture
Benefits
- Readability: Rules directly match Unicode Standard specification
- Maintainability: Easy to understand, modify, and extend
- Debuggability: Each rule can be tested and traced independently
Conformance
100% conformance maintained on all official Unicode test suites:
- Grapheme: 766/766 tests passing
- Word: 1,944/1,944 tests passing
- Sentence: 512/512 tests passing
Installation
go get github.com/SCKelemen/unicode/[email protected]v3.0.0: Hierarchical Break Detection
Performance Improvements
Version 3.0.0 implements hierarchical optimization for the single-pass FindAllBreaks() API introduced in v2.0.0.
Hierarchical Break Detection
Leverages the natural subset relationships between break types:
- Words ⊆ Graphemes: Word breaks only checked at grapheme cluster boundaries
- Sentences ⊆ Words: Sentence breaks only checked at word boundaries
This eliminates redundant checks and significantly improves performance.
Benchmark Results
Performance on Apple M4 Pro comparing v3.0.0 single-pass vs three separate function calls:
| Text Length | v2.0.0 Three Passes | v3.0.0 Single Pass | Speedup |
|---|---|---|---|
| Short (33 chars) | 3,457 ns/op | 2,197 ns/op | 1.57x |
| Medium (86 chars) | 16,191 ns/op | 9,636 ns/op | 1.68x |
| Long (467 chars) | 423,491 ns/op | 188,982 ns/op | 2.24x |
Key benefits:
- Speedup increases with text length (hierarchical pruning more effective on longer text)
- Single UTF-8 decode and classification pass
- Pre-classified data reused across all three break types
- No additional memory allocations compared to v2.0.0
Conformance
Maintains 100% conformance on all official Unicode 17.0.0 test suites:
- Grapheme: 766/766 tests passing
- Word: 1,944/1,944 tests passing
- Sentence: 512/512 tests passing
Breaking Changes
None - all existing APIs remain backward compatible.
v2.0.0: Table-Driven O(log n) Architecture
Performance Improvements
Version 2.0.0 focuses on performance optimization while maintaining 100% conformance with Unicode standards.
Table-Driven Binary Search
All packages now use table-driven O(log n) binary search for character classification, replacing sequential O(n) checks:
- UAX #9: Bidi class lookup optimized with 3,060 precomputed ranges from
DerivedBidiClass.txt - UAX #29: Unified packed data structure with 4,673 ranges encoding all three break types (grapheme, word, sentence) in 16-bit format
Performance: Character classification now runs at ~60-100 ns/op with 0 allocations on Apple M4 Pro.
Generated Unicode Data
All Unicode property data is now generated directly from official Unicode 17.0.0 data files:
- Download from unicode.org during build
- Parse property files (
DerivedBidiClass.txt,GraphemeBreakProperty.txt, etc.) - Generate optimized Go code with binary search tables
- Ensures correctness and synchronization with Unicode standard
Single-Pass API
UAX #29 provides a new FindAllBreaks() API that computes grapheme, word, and sentence boundaries in a single traversal.
Conformance
Maintains 100% Unicode conformance on all official test suites:
- UAX #9: 513,494/513,494 tests passing
- UAX #14: 19,338/19,338 tests passing
- UAX #29: 3,222/3,222 tests passing (766+1944+512)
- UTS #51: 5,223/5,223 tests passing
v1.0.0 - Unicode 17.0.0 Implementations
🎉 First stable release of Unicode Standard Annexes implementations in Go!
📦 Packages
UAX #11: East Asian Width
- Character width classification for terminal emulators
- Context-aware width resolution for ambiguous characters
- Display width calculations for CJK text
- Unicode 17.0.0 conformance
UTS #51: Unicode Emoji
- Six emoji properties (Emoji, Emoji_Presentation, etc.)
- Terminal width calculation for emoji
- Sequence validation (keycap, tag, modifier, flag, ZWJ)
- 100% conformance (5,223/5,223 tests passing)
UAX #50: Vertical Text Layout
- Vertical orientation properties for East Asian typography
- Four orientation values (Rotated, Upright, Transformed)
- Mixed-script vertical text support
- Unicode 17.0.0 conformance
UAX #9: Bidirectional Algorithm
- Bidirectional text reordering for LTR/RTL scripts
- Full isolating run sequences (BD13)
- Bracket pair handling (N0 rule)
- 100% conformance (513,494/513,494 tests passing)
UAX #14: Line Breaking Algorithm
- Line break opportunity detection
- Three hyphenation modes (none, manual, auto)
- CJK ideographic text support
- 100% conformance (19,338/19,338 tests passing)
UAX #29: Text Segmentation
- Grapheme cluster boundaries (user-perceived characters)
- Word boundaries for text selection
- Sentence boundaries for text processing
- 100% conformance (3,222/3,222 tests passing)
🏆 Achievements
- 541,277/541,277 total tests passing across all packages
- 100% conformance on all testable specifications
- Zero external dependencies - standard library only
- Unicode 17.0.0 - latest Unicode version
- Clean commit history - logical progression from first principles
📜 License
BearWare 1.0 (MIT Compatible) - 🐻🌲🐻❄️ Help the bear. 🐻❄️🌲🐻
🙏 Acknowledgments
Unicode® is a registered trademark of Unicode, Inc.
All Unicode data files are copyright © Unicode, Inc.