Skip to content

Releases: SCKelemen/unicode

v6.0.0: Memory Optimization and ASCII Fast Paths

17 Dec 12:57

Choose a tag to compare

Version 6.0.0: Memory Optimization and ASCII Fast Paths

Version 6.0.0 focuses on memory optimization and ASCII fast paths to dramatically improve performance for common cases while maintaining 100% Unicode conformance.

🚀 Performance Improvements

ASCII Fast Paths (100x+ Speedups)

UTS #15 (Normalization):

  • ASCII NFC normalization: 129x faster (7.68 ns/op vs 995 ns/op)
  • ASCII NFKC normalization: 144x faster (7.72 ns/op vs 1,115 ns/op)
  • 🎯 ASCII text normalization is essentially FREE (single isASCII() check)

UTS #39 (Security):

  • ASCII mixed-script check: 34x faster (4.18 ns/op vs 142 ns/op)
  • ASCII safe identifier check: 3.7x faster (74.7 ns/op vs 277 ns/op)

Unicode Text Improvements

UTS #39 (Security):

  • Skeleton algorithm: 2.5x faster (174 ns/op vs 430 ns/op)
  • Confusable detection: 1.7x faster (502 ns/op vs 874 ns/op)

UTS #15 (Normalization):

  • NFKC: 8% faster (5,390 ns/op vs 5,877 ns/op)
  • NFKD: 6% faster (3,135 ns/op vs 3,337 ns/op)

💾 Memory Improvements

Type Size Reductions

Component Before After Savings
combiningClassMap (UTS #15) ~15.5 KB ~7.75 KB 50% (7.75 KB)
Script type (UAX #24) 8 bytes/value 1 byte/value 87.5% (7 bytes)
BreakClass type (UAX #14) 8 bytes/value 1 byte/value 87.5% (7 bytes)

🎯 All runtime structures using these types are 50-87.5% smaller with better CPU cache density.

🔧 Technical Changes

Type Reductions

  • UTS #15: combiningClassMap changed from map[rune]int to map[rune]uint8

    • Unicode combining classes range 0-240, fit perfectly in uint8 (0-255)
  • UAX #24: Script type changed from int to uint8

    • 176 Unicode scripts fit comfortably in uint8 (0-255)
  • UAX #14: BreakClass type changed from int to uint8

    • 66 break classes fit in uint8 (0-255)

ASCII Fast Paths

UTS #15 (Normalization):

  • Added isASCII() check to NFC, NFD, NFKC, NFKD functions
  • ASCII text is already normalized in all forms
  • Avoids expensive decomposition/composition operations

UTS #39 (Security):

  • ASCII fast paths in IsMixedScript() - ASCII is single-script (Latin)
  • ASCII fast paths in IsSafeIdentifier() - ASCII identifiers only need validation
  • Skips expensive script analysis for common identifiers

🌍 Real-World Impact

Typical web application (mostly ASCII identifiers):

  • Variable name validation: 34x faster
  • URL normalization: 129x faster
  • Username security checks: 3.7x faster

International text (mixed Unicode):

  • Confusable detection: 2.5x faster
  • Text normalization: 3-8% faster
  • Security validation: 1.7x faster

✅ Conformance Maintained

100% conformance maintained on all official Unicode test suites:

  • UTS #15: 20,034/20,034 normalization tests passing
  • UAX #24: 159,866/159,866 script property tests passing
  • UTS #39: 6,565/6,565 confusable mappings verified
  • Total: All 207,333 tests passing

🎯 Key Benefits

ASCII normalization: 129-144x faster (essentially free)
ASCII security checks: 34x faster
Skeleton algorithm: 2.5x faster for all text
Confusable detection: 1.7x faster for all text
Memory footprint: ~15-20 KB saved, 50-87.5% reduction in type sizes
Conformance: 100% maintained

📝 Design Philosophy

The optimizations excel at what matters most:

  • Common case (ASCII) is blazingly fast (100x+ speedups)
  • Full Unicode support still provides solid improvements (1.7-2.5x)
  • 100% correctness maintained everywhere

🔨 Breaking Changes

None. All changes are backwards compatible.

📦 Installation

go get github.com/SCKelemen/unicode/[email protected]
go get github.com/SCKelemen/unicode/[email protected]
go get github.com/SCKelemen/unicode/[email protected]

🙏 Benchmarks

All benchmarks run on Apple M4 Pro. See the README for detailed benchmark results and methodology.

v5.0.0 - Rule-Based Line Breaking Architecture 🐻

17 Dec 10:43

Choose a tag to compare

v5.0.0 - Rule-Based Line Breaking Architecture

🎯 Major Achievement: 100% UAX #14 Conformance

This release extends the rule-based state machine architecture from UAX #29 (v4.0.0) to UAX #14 (Line Breaking Algorithm), achieving 100% conformance on all 19,338 official Unicode tests.

✨ What's New

Rule-Based Line Breaking Implementation

UAX #14 now uses a clean, rule-based architecture that directly maps to the Unicode Standard specification:

  • LineBreakContext abstraction: Clean navigation API with helper methods

    • SkipBackward/SkipForward: Skip over combining marks (LB9 rule)
    • FindForward/FindBackward: Search for target classes
    • MatchSequence: Pattern matching for rule sequences
  • 59 Named rule functions: Each Unicode rule (LB4, LB5, LB8, LB21, etc.) becomes a named, testable function

  • Declarative rule chains: First-match-wins strategy with clear precedence

  • Pair table fallback: Common cases handled by efficient 2,064-entry lookup table

100% Conformance Fixes

Achieved perfect conformance by fixing these edge cases:

  1. French guillemet separators (»word« pattern)

    • Pattern: « SP ÷ AL when part of emphasis, not quotation
    • U+00AB/U+00BB require special break handling
  2. German quotes („..." and ‚...' patterns)

    • ClassQU_Pi acts as closing quote (not opening)
    • U+201E/U+201A (ClassOP) open, U+201C/U+2018 (ClassQU_Pi) close
  3. Hebrew MAQAF (U+05BE hyphen)

    • HL × HH ÷ HL pattern for Hebrew hyphen
    • New ruleLB21_HH_Break handles (HL | AL) × HH ÷ HL
  4. Regional indicators with combining marks

    • RI × CM × RI sequences
    • ruleLB30a now skips CM/ZWJ when counting RIs
  5. Extended pictographic × emoji modifier

    • Reserved emoji ranges (U+1F000-U+1FFFD)
    • ruleLB30b checks isExtendedPictographic for any base class

📊 Test Results

Total tests: 19,338
Passed: 19,338 (100.0%)
Failed: 0 (0.0%)

🏗️ Architecture Benefits

Before (Original Implementation)

  • 1,112-line monolithic function
  • Complex inline conditionals
  • Difficult to debug and extend

After (Rule-Based Implementation)

  • Isolated, independently testable rule functions
  • Direct spec mapping (ruleLB4, ruleLB21, etc.)
  • Clear documentation with spec links
  • Easy to add new rules without refactoring
  • No massive conditional chains

⚡ Performance Impact

The rule-based implementation is 2-3x slower due to abstraction overhead:

Text Length Original Rule-Based Change
Short (10 chars) 494 ns/op 1,360 ns/op 2.75x slower
Medium (64 chars) 3,934 ns/op 9,374 ns/op 2.38x slower
Long (45 chars) 2,138 ns/op 5,209 ns/op 2.44x slower

Trade-off: Performance remains excellent for text layout (thousands of characters per millisecond). The maintainability benefits far outweigh the performance cost for this use case.

🐻 License Update

Updated to BearWare 1.0 - MIT License with bear emojis:

  • Less corporate feel
  • Easy to detect in the wild
  • Shows we're weekend warriors, not a corporation

📦 New Files

  • uax14/context.go - LineBreakContext abstraction (354 lines)
  • uax14/linebreak_rules.go - Rule-based implementation (1,786 lines, 59 rule functions)
  • uax14/linebreak_rules_test.go - Test suite with conformance tests
  • uax14/LINEBREAK_RULES.md - Comprehensive rule documentation
  • LICENSE - BearWare 1.0 license with bear emoji ASCII art

🔧 Breaking Changes

None - the original implementation remains available as FindLineBreakOpportunities. The new rule-based implementation is exposed via FindLineBreakOpportunitiesWithRules for testing and comparison.

🎓 What This Means

This architecture provides:

  1. Direct spec mapping: Rule functions named after Unicode spec rules
  2. Independent testing: Each rule can be tested and traced independently
  3. Clear debugging: Rule execution can be logged to understand break decisions
  4. Easy updates: New Unicode versions can add rules without refactoring
  5. Reduced complexity: No massive conditional chains or inline state tracking

This matches the successful pattern from UAX #29 v4.0.0, providing consistency across the codebase.

🔗 References

🙏 Acknowledgments

This release demonstrates rigorous engineering while maintaining a personal, accessible approach. Made with care by weekend warriors. 🐻


Full Changelog: v4.0.0...v5.0.0

v4.0.0: Rule-Based State Machine Architecture

16 Dec 20:53

Choose a tag to compare

Version 4.0.0: Rule-Based State Machine Architecture

This release focuses on code quality and maintainability through rule-based state machine architecture for all break detection algorithms.

New Features

  • BreakContext abstractions: GraphemeBreakContext, WordBreakContext, SentenceBreakContext provide clean navigation APIs
  • Named rule functions: Each Unicode rule (GB3, WB5, SB8, etc.) becomes a named function with clear semantics
  • Declarative rule chains: Rules checked in order with first-match-wins strategy
  • Maintained hierarchical optimization: Words checked only at grapheme boundaries, sentences only at word boundaries

Code Organization

New files implementing the rule-based architecture:

  • context.go - Break context abstractions with navigation methods (661 lines)
  • grapheme_rules.go - Grapheme breaking rules (ruleGB3 through ruleGB12_13, 308 lines)
  • word_rules.go - Word breaking rules (ruleWB3 through ruleWB15_16, 376 lines)
  • sentence_rules.go - Sentence breaking rules (ruleSB3 through ruleSB11, 244 lines)
  • single_pass.go - Cleaned up to use rule-based implementations (96 lines vs 574 lines)

Performance (Apple M4 Pro)

Rule-based grapheme breaking alone:

Text Length v3.0.0 Inline v4.0.0 Rule-Based Speedup
Short (33 chars) 1,882 ns/op 1,183 ns/op 1.59x
Medium (86 chars) 8,759 ns/op 3,041 ns/op 2.88x
Long (467 chars) 168,060 ns/op 15,170 ns/op 11.08x

Single-Pass API:

Text Length v3.0.0 Inline v4.0.0 Rule-Based Change
Short (33 chars) 2,197 ns/op 2,717 ns/op 1.24x slower
Medium (86 chars) 9,636 ns/op 6,647 ns/op 1.45x faster
Long (467 chars) 188,982 ns/op 32,200 ns/op 5.87x faster

Single-Pass vs Three Separate Passes (v4.0.0):

Text Length Single Pass Three Separate Speedup
Short (33 chars) 2,717 ns/op 3,380 ns/op 1.24x
Medium (86 chars) 6,647 ns/op 14,312 ns/op 2.15x
Long (467 chars) 32,200 ns/op 239,624 ns/op 7.44x

Key findings:

  • Rule-based grapheme breaking provides 1.6-11x speedup over inline implementation
  • Performance improvements increase dramatically with text length
  • Single-pass API maintains significant advantage over three separate calls
  • Medium and long texts benefit most from rule-based architecture

Benefits

  • Readability: Rules directly match Unicode Standard specification
  • Maintainability: Easy to understand, modify, and extend
  • Debuggability: Each rule can be tested and traced independently

Conformance

100% conformance maintained on all official Unicode test suites:

  • Grapheme: 766/766 tests passing
  • Word: 1,944/1,944 tests passing
  • Sentence: 512/512 tests passing

Installation

go get github.com/SCKelemen/unicode/[email protected]

v3.0.0: Hierarchical Break Detection

16 Dec 20:20

Choose a tag to compare

Performance Improvements

Version 3.0.0 implements hierarchical optimization for the single-pass FindAllBreaks() API introduced in v2.0.0.

Hierarchical Break Detection

Leverages the natural subset relationships between break types:

  • Words ⊆ Graphemes: Word breaks only checked at grapheme cluster boundaries
  • Sentences ⊆ Words: Sentence breaks only checked at word boundaries

This eliminates redundant checks and significantly improves performance.

Benchmark Results

Performance on Apple M4 Pro comparing v3.0.0 single-pass vs three separate function calls:

Text Length v2.0.0 Three Passes v3.0.0 Single Pass Speedup
Short (33 chars) 3,457 ns/op 2,197 ns/op 1.57x
Medium (86 chars) 16,191 ns/op 9,636 ns/op 1.68x
Long (467 chars) 423,491 ns/op 188,982 ns/op 2.24x

Key benefits:

  • Speedup increases with text length (hierarchical pruning more effective on longer text)
  • Single UTF-8 decode and classification pass
  • Pre-classified data reused across all three break types
  • No additional memory allocations compared to v2.0.0

Conformance

Maintains 100% conformance on all official Unicode 17.0.0 test suites:

  • Grapheme: 766/766 tests passing
  • Word: 1,944/1,944 tests passing
  • Sentence: 512/512 tests passing

Breaking Changes

None - all existing APIs remain backward compatible.

v2.0.0: Table-Driven O(log n) Architecture

16 Dec 20:20

Choose a tag to compare

Performance Improvements

Version 2.0.0 focuses on performance optimization while maintaining 100% conformance with Unicode standards.

Table-Driven Binary Search

All packages now use table-driven O(log n) binary search for character classification, replacing sequential O(n) checks:

  • UAX #9: Bidi class lookup optimized with 3,060 precomputed ranges from DerivedBidiClass.txt
  • UAX #29: Unified packed data structure with 4,673 ranges encoding all three break types (grapheme, word, sentence) in 16-bit format

Performance: Character classification now runs at ~60-100 ns/op with 0 allocations on Apple M4 Pro.

Generated Unicode Data

All Unicode property data is now generated directly from official Unicode 17.0.0 data files:

  • Download from unicode.org during build
  • Parse property files (DerivedBidiClass.txt, GraphemeBreakProperty.txt, etc.)
  • Generate optimized Go code with binary search tables
  • Ensures correctness and synchronization with Unicode standard

Single-Pass API

UAX #29 provides a new FindAllBreaks() API that computes grapheme, word, and sentence boundaries in a single traversal.

Conformance

Maintains 100% Unicode conformance on all official test suites:

  • UAX #9: 513,494/513,494 tests passing
  • UAX #14: 19,338/19,338 tests passing
  • UAX #29: 3,222/3,222 tests passing (766+1944+512)
  • UTS #51: 5,223/5,223 tests passing

v1.0.0 - Unicode 17.0.0 Implementations

16 Dec 10:31

Choose a tag to compare

🎉 First stable release of Unicode Standard Annexes implementations in Go!

📦 Packages

UAX #11: East Asian Width

  • Character width classification for terminal emulators
  • Context-aware width resolution for ambiguous characters
  • Display width calculations for CJK text
  • Unicode 17.0.0 conformance

UTS #51: Unicode Emoji

  • Six emoji properties (Emoji, Emoji_Presentation, etc.)
  • Terminal width calculation for emoji
  • Sequence validation (keycap, tag, modifier, flag, ZWJ)
  • 100% conformance (5,223/5,223 tests passing)

UAX #50: Vertical Text Layout

  • Vertical orientation properties for East Asian typography
  • Four orientation values (Rotated, Upright, Transformed)
  • Mixed-script vertical text support
  • Unicode 17.0.0 conformance

UAX #9: Bidirectional Algorithm

  • Bidirectional text reordering for LTR/RTL scripts
  • Full isolating run sequences (BD13)
  • Bracket pair handling (N0 rule)
  • 100% conformance (513,494/513,494 tests passing)

UAX #14: Line Breaking Algorithm

  • Line break opportunity detection
  • Three hyphenation modes (none, manual, auto)
  • CJK ideographic text support
  • 100% conformance (19,338/19,338 tests passing)

UAX #29: Text Segmentation

  • Grapheme cluster boundaries (user-perceived characters)
  • Word boundaries for text selection
  • Sentence boundaries for text processing
  • 100% conformance (3,222/3,222 tests passing)

🏆 Achievements

  • 541,277/541,277 total tests passing across all packages
  • 100% conformance on all testable specifications
  • Zero external dependencies - standard library only
  • Unicode 17.0.0 - latest Unicode version
  • Clean commit history - logical progression from first principles

📜 License

BearWare 1.0 (MIT Compatible) - 🐻🌲🐻‍❄️ Help the bear. 🐻‍❄️🌲🐻

🙏 Acknowledgments

Unicode® is a registered trademark of Unicode, Inc.
All Unicode data files are copyright © Unicode, Inc.