17 Dec 12:57

4da5b54

v6.0.0: Memory Optimization and ASCII Fast Paths Latest

Latest

Version 6.0.0: Memory Optimization and ASCII Fast Paths

Version 6.0.0 focuses on memory optimization and ASCII fast paths to dramatically improve performance for common cases while maintaining 100% Unicode conformance.

🚀 Performance Improvements

ASCII Fast Paths (100x+ Speedups)

UTS #15 (Normalization):

ASCII NFC normalization: 129x faster (7.68 ns/op vs 995 ns/op)
ASCII NFKC normalization: 144x faster (7.72 ns/op vs 1,115 ns/op)
🎯 ASCII text normalization is essentially FREE (single isASCII() check)

UTS #39 (Security):

ASCII mixed-script check: 34x faster (4.18 ns/op vs 142 ns/op)
ASCII safe identifier check: 3.7x faster (74.7 ns/op vs 277 ns/op)

Unicode Text Improvements

UTS #39 (Security):

Skeleton algorithm: 2.5x faster (174 ns/op vs 430 ns/op)
Confusable detection: 1.7x faster (502 ns/op vs 874 ns/op)

UTS #15 (Normalization):

NFKC: 8% faster (5,390 ns/op vs 5,877 ns/op)
NFKD: 6% faster (3,135 ns/op vs 3,337 ns/op)

💾 Memory Improvements

Type Size Reductions

Component	Before	After	Savings
combiningClassMap (UTS #15)	~15.5 KB	~7.75 KB	50% (7.75 KB)
Script type (UAX #24)	8 bytes/value	1 byte/value	87.5% (7 bytes)
BreakClass type (UAX #14)	8 bytes/value	1 byte/value	87.5% (7 bytes)

🎯 All runtime structures using these types are 50-87.5% smaller with better CPU cache density.

🔧 Technical Changes

Type Reductions

UTS #15: combiningClassMap changed from map[rune]int to map[rune]uint8
- Unicode combining classes range 0-240, fit perfectly in uint8 (0-255)
UAX #24: Script type changed from int to uint8
- 176 Unicode scripts fit comfortably in uint8 (0-255)
UAX #14: BreakClass type changed from int to uint8
- 66 break classes fit in uint8 (0-255)

ASCII Fast Paths

UTS #15 (Normalization):

Added isASCII() check to NFC, NFD, NFKC, NFKD functions
ASCII text is already normalized in all forms
Avoids expensive decomposition/composition operations

UTS #39 (Security):

ASCII fast paths in IsMixedScript() - ASCII is single-script (Latin)
ASCII fast paths in IsSafeIdentifier() - ASCII identifiers only need validation
Skips expensive script analysis for common identifiers

🌍 Real-World Impact

Typical web application (mostly ASCII identifiers):

Variable name validation: 34x faster
URL normalization: 129x faster
Username security checks: 3.7x faster

International text (mixed Unicode):

Confusable detection: 2.5x faster
Text normalization: 3-8% faster
Security validation: 1.7x faster

✅ Conformance Maintained

100% conformance maintained on all official Unicode test suites:

UTS #15: 20,034/20,034 normalization tests passing
UAX #24: 159,866/159,866 script property tests passing
UTS #39: 6,565/6,565 confusable mappings verified
Total: All 207,333 tests passing

🎯 Key Benefits

✅ ASCII normalization: 129-144x faster (essentially free)
✅ ASCII security checks: 34x faster
✅ Skeleton algorithm: 2.5x faster for all text
✅ Confusable detection: 1.7x faster for all text
✅ Memory footprint: ~15-20 KB saved, 50-87.5% reduction in type sizes
✅ Conformance: 100% maintained

📝 Design Philosophy

The optimizations excel at what matters most:

Common case (ASCII) is blazingly fast (100x+ speedups)
Full Unicode support still provides solid improvements (1.7-2.5x)
100% correctness maintained everywhere

🔨 Breaking Changes

None. All changes are backwards compatible.

📦 Installation

go get github.com/SCKelemen/unicode/[email protected]
go get github.com/SCKelemen/unicode/[email protected]
go get github.com/SCKelemen/unicode/[email protected]

🙏 Benchmarks

All benchmarks run on Apple M4 Pro. See the README for detailed benchmark results and methodology.

Assets 2

17 Dec 10:43

SCKelemen

v5.0.0

c90048f

v5.0.0 - Rule-Based Line Breaking Architecture 🐻

v5.0.0 - Rule-Based Line Breaking Architecture

🎯 Major Achievement: 100% UAX #14 Conformance

This release extends the rule-based state machine architecture from UAX #29 (v4.0.0) to UAX #14 (Line Breaking Algorithm), achieving 100% conformance on all 19,338 official Unicode tests.

✨ What's New

Rule-Based Line Breaking Implementation

UAX #14 now uses a clean, rule-based architecture that directly maps to the Unicode Standard specification:

LineBreakContext abstraction: Clean navigation API with helper methods
- SkipBackward/SkipForward: Skip over combining marks (LB9 rule)
- FindForward/FindBackward: Search for target classes
- MatchSequence: Pattern matching for rule sequences
59 Named rule functions: Each Unicode rule (LB4, LB5, LB8, LB21, etc.) becomes a named, testable function
Declarative rule chains: First-match-wins strategy with clear precedence
Pair table fallback: Common cases handled by efficient 2,064-entry lookup table

100% Conformance Fixes

Achieved perfect conformance by fixing these edge cases:

French guillemet separators (»word« pattern)
- Pattern: « SP ÷ AL when part of emphasis, not quotation
- U+00AB/U+00BB require special break handling
German quotes („..." and ‚...' patterns)
- ClassQU_Pi acts as closing quote (not opening)
- U+201E/U+201A (ClassOP) open, U+201C/U+2018 (ClassQU_Pi) close
Hebrew MAQAF (U+05BE hyphen)
- HL × HH ÷ HL pattern for Hebrew hyphen
- New ruleLB21_HH_Break handles (HL | AL) × HH ÷ HL
Regional indicators with combining marks
- RI × CM × RI sequences
- ruleLB30a now skips CM/ZWJ when counting RIs
Extended pictographic × emoji modifier
- Reserved emoji ranges (U+1F000-U+1FFFD)
- ruleLB30b checks isExtendedPictographic for any base class

📊 Test Results

Total tests: 19,338
Passed: 19,338 (100.0%)
Failed: 0 (0.0%)

🏗️ Architecture Benefits

Before (Original Implementation)

1,112-line monolithic function
Complex inline conditionals
Difficult to debug and extend

After (Rule-Based Implementation)

Isolated, independently testable rule functions
Direct spec mapping (ruleLB4, ruleLB21, etc.)
Clear documentation with spec links
Easy to add new rules without refactoring
No massive conditional chains

⚡ Performance Impact

The rule-based implementation is 2-3x slower due to abstraction overhead:

Text Length	Original	Rule-Based	Change
Short (10 chars)	494 ns/op	1,360 ns/op	2.75x slower
Medium (64 chars)	3,934 ns/op	9,374 ns/op	2.38x slower
Long (45 chars)	2,138 ns/op	5,209 ns/op	2.44x slower

Trade-off: Performance remains excellent for text layout (thousands of characters per millisecond). The maintainability benefits far outweigh the performance cost for this use case.

🐻 License Update

Updated to BearWare 1.0 - MIT License with bear emojis:

Less corporate feel
Easy to detect in the wild
Shows we're weekend warriors, not a corporation

📦 New Files

uax14/context.go - LineBreakContext abstraction (354 lines)
uax14/linebreak_rules.go - Rule-based implementation (1,786 lines, 59 rule functions)
uax14/linebreak_rules_test.go - Test suite with conformance tests
uax14/LINEBREAK_RULES.md - Comprehensive rule documentation
LICENSE - BearWare 1.0 license with bear emoji ASCII art

🔧 Breaking Changes

None - the original implementation remains available as FindLineBreakOpportunities. The new rule-based implementation is exposed via FindLineBreakOpportunitiesWithRules for testing and comparison.

🎓 What This Means

This architecture provides:

Direct spec mapping: Rule functions named after Unicode spec rules
Independent testing: Each rule can be tested and traced independently
Clear debugging: Rule execution can be logged to understand break decisions
Easy updates: New Unicode versions can add rules without refactoring
Reduced complexity: No massive conditional chains or inline state tracking

This matches the successful pattern from UAX #29 v4.0.0, providing consistency across the codebase.

🔗 References

🙏 Acknowledgments

This release demonstrates rigorous engineering while maintaining a personal, accessible approach. Made with care by weekend warriors. 🐻

Full Changelog: v4.0.0...v5.0.0

Assets 2

16 Dec 20:53

SCKelemen

v4.0.0

ac2f800

v4.0.0: Rule-Based State Machine Architecture

Version 4.0.0: Rule-Based State Machine Architecture

This release focuses on code quality and maintainability through rule-based state machine architecture for all break detection algorithms.

New Features

BreakContext abstractions: GraphemeBreakContext, WordBreakContext, SentenceBreakContext provide clean navigation APIs
Named rule functions: Each Unicode rule (GB3, WB5, SB8, etc.) becomes a named function with clear semantics
Declarative rule chains: Rules checked in order with first-match-wins strategy
Maintained hierarchical optimization: Words checked only at grapheme boundaries, sentences only at word boundaries

Code Organization

New files implementing the rule-based architecture:

context.go - Break context abstractions with navigation methods (661 lines)
grapheme_rules.go - Grapheme breaking rules (ruleGB3 through ruleGB12_13, 308 lines)
word_rules.go - Word breaking rules (ruleWB3 through ruleWB15_16, 376 lines)
sentence_rules.go - Sentence breaking rules (ruleSB3 through ruleSB11, 244 lines)
single_pass.go - Cleaned up to use rule-based implementations (96 lines vs 574 lines)

Performance (Apple M4 Pro)

Rule-based grapheme breaking alone:

Text Length	v3.0.0 Inline	v4.0.0 Rule-Based	Speedup
Short (33 chars)	1,882 ns/op	1,183 ns/op	1.59x
Medium (86 chars)	8,759 ns/op	3,041 ns/op	2.88x
Long (467 chars)	168,060 ns/op	15,170 ns/op	11.08x

Single-Pass API:

Text Length	v3.0.0 Inline	v4.0.0 Rule-Based	Change
Short (33 chars)	2,197 ns/op	2,717 ns/op	1.24x slower
Medium (86 chars)	9,636 ns/op	6,647 ns/op	1.45x faster
Long (467 chars)	188,982 ns/op	32,200 ns/op	5.87x faster

Single-Pass vs Three Separate Passes (v4.0.0):

Text Length	Single Pass	Three Separate	Speedup
Short (33 chars)	2,717 ns/op	3,380 ns/op	1.24x
Medium (86 chars)	6,647 ns/op	14,312 ns/op	2.15x
Long (467 chars)	32,200 ns/op	239,624 ns/op	7.44x

Key findings:

Rule-based grapheme breaking provides 1.6-11x speedup over inline implementation
Performance improvements increase dramatically with text length
Single-pass API maintains significant advantage over three separate calls
Medium and long texts benefit most from rule-based architecture

Benefits

Readability: Rules directly match Unicode Standard specification
Maintainability: Easy to understand, modify, and extend
Debuggability: Each rule can be tested and traced independently

Conformance

100% conformance maintained on all official Unicode test suites:

Grapheme: 766/766 tests passing
Word: 1,944/1,944 tests passing
Sentence: 512/512 tests passing

Installation

go get github.com/SCKelemen/unicode/[email protected]

Assets 2

16 Dec 20:20

SCKelemen

v3.0.0

d3577e3

v3.0.0: Hierarchical Break Detection

Performance Improvements

Version 3.0.0 implements hierarchical optimization for the single-pass FindAllBreaks() API introduced in v2.0.0.

Hierarchical Break Detection

Leverages the natural subset relationships between break types:

Words ⊆ Graphemes: Word breaks only checked at grapheme cluster boundaries
Sentences ⊆ Words: Sentence breaks only checked at word boundaries

This eliminates redundant checks and significantly improves performance.

Benchmark Results

Performance on Apple M4 Pro comparing v3.0.0 single-pass vs three separate function calls:

Text Length	v2.0.0 Three Passes	v3.0.0 Single Pass	Speedup
Short (33 chars)	3,457 ns/op	2,197 ns/op	1.57x
Medium (86 chars)	16,191 ns/op	9,636 ns/op	1.68x
Long (467 chars)	423,491 ns/op	188,982 ns/op	2.24x

Key benefits:

Speedup increases with text length (hierarchical pruning more effective on longer text)
Single UTF-8 decode and classification pass
Pre-classified data reused across all three break types
No additional memory allocations compared to v2.0.0

Conformance

Maintains 100% conformance on all official Unicode 17.0.0 test suites:

Grapheme: 766/766 tests passing
Word: 1,944/1,944 tests passing
Sentence: 512/512 tests passing

Breaking Changes

None - all existing APIs remain backward compatible.

Assets 2

16 Dec 20:20

SCKelemen

v2.0.0

539a623

v2.0.0: Table-Driven O(log n) Architecture

Performance Improvements

Version 2.0.0 focuses on performance optimization while maintaining 100% conformance with Unicode standards.

Table-Driven Binary Search

All packages now use table-driven O(log n) binary search for character classification, replacing sequential O(n) checks:

UAX #9: Bidi class lookup optimized with 3,060 precomputed ranges from DerivedBidiClass.txt
UAX #29: Unified packed data structure with 4,673 ranges encoding all three break types (grapheme, word, sentence) in 16-bit format

Performance: Character classification now runs at ~60-100 ns/op with 0 allocations on Apple M4 Pro.

Generated Unicode Data

All Unicode property data is now generated directly from official Unicode 17.0.0 data files:

Download from unicode.org during build
Parse property files (DerivedBidiClass.txt, GraphemeBreakProperty.txt, etc.)
Generate optimized Go code with binary search tables
Ensures correctness and synchronization with Unicode standard

Single-Pass API

UAX #29 provides a new FindAllBreaks() API that computes grapheme, word, and sentence boundaries in a single traversal.

Conformance

Maintains 100% Unicode conformance on all official test suites:

UAX #9: 513,494/513,494 tests passing
UAX #14: 19,338/19,338 tests passing
UAX #29: 3,222/3,222 tests passing (766+1944+512)
UTS #51: 5,223/5,223 tests passing

Assets 2

16 Dec 10:31

SCKelemen

v1.0.0

79703da

v1.0.0 - Unicode 17.0.0 Implementations

🎉 First stable release of Unicode Standard Annexes implementations in Go!

📦 Packages

UAX #11: East Asian Width

Character width classification for terminal emulators
Context-aware width resolution for ambiguous characters
Display width calculations for CJK text
Unicode 17.0.0 conformance

UTS #51: Unicode Emoji

Six emoji properties (Emoji, Emoji_Presentation, etc.)
Terminal width calculation for emoji
Sequence validation (keycap, tag, modifier, flag, ZWJ)
100% conformance (5,223/5,223 tests passing)

UAX #50: Vertical Text Layout

Vertical orientation properties for East Asian typography
Four orientation values (Rotated, Upright, Transformed)
Mixed-script vertical text support
Unicode 17.0.0 conformance

UAX #9: Bidirectional Algorithm

Bidirectional text reordering for LTR/RTL scripts
Full isolating run sequences (BD13)
Bracket pair handling (N0 rule)
100% conformance (513,494/513,494 tests passing)

UAX #14: Line Breaking Algorithm

Line break opportunity detection
Three hyphenation modes (none, manual, auto)
CJK ideographic text support
100% conformance (19,338/19,338 tests passing)

UAX #29: Text Segmentation

Grapheme cluster boundaries (user-perceived characters)
Word boundaries for text selection
Sentence boundaries for text processing
100% conformance (3,222/3,222 tests passing)

🏆 Achievements

541,277/541,277 total tests passing across all packages
100% conformance on all testable specifications
Zero external dependencies - standard library only
Unicode 17.0.0 - latest Unicode version
Clean commit history - logical progression from first principles

📜 License

BearWare 1.0 (MIT Compatible) - 🐻🌲🐻‍❄️ Help the bear. 🐻‍❄️🌲🐻

🙏 Acknowledgments

Assets 2

Releases: SCKelemen/unicode

v6.0.0: Memory Optimization and ASCII Fast Paths

Version 6.0.0: Memory Optimization and ASCII Fast Paths

🚀 Performance Improvements

ASCII Fast Paths (100x+ Speedups)

Unicode Text Improvements

💾 Memory Improvements

Type Size Reductions

🔧 Technical Changes

Type Reductions

ASCII Fast Paths

🌍 Real-World Impact

✅ Conformance Maintained

🎯 Key Benefits

📝 Design Philosophy

🔨 Breaking Changes

📦 Installation

🙏 Benchmarks

Uh oh!

v5.0.0 - Rule-Based Line Breaking Architecture 🐻

v5.0.0 - Rule-Based Line Breaking Architecture

🎯 Major Achievement: 100% UAX #14 Conformance

✨ What's New

Rule-Based Line Breaking Implementation

100% Conformance Fixes

📊 Test Results

🏗️ Architecture Benefits

Before (Original Implementation)

After (Rule-Based Implementation)

⚡ Performance Impact

🐻 License Update

📦 New Files

🔧 Breaking Changes

🎓 What This Means

🔗 References

🙏 Acknowledgments

Uh oh!

v4.0.0: Rule-Based State Machine Architecture

Version 4.0.0: Rule-Based State Machine Architecture

New Features

Code Organization

Performance (Apple M4 Pro)

Rule-based grapheme breaking alone:

Single-Pass API:

Single-Pass vs Three Separate Passes (v4.0.0):

Benefits

Conformance

Installation

Uh oh!

v3.0.0: Hierarchical Break Detection

Performance Improvements

Hierarchical Break Detection

Benchmark Results

Conformance

Breaking Changes

Uh oh!

v2.0.0: Table-Driven O(log n) Architecture

Performance Improvements

Table-Driven Binary Search

Generated Unicode Data

Single-Pass API

Conformance

Uh oh!

v1.0.0 - Unicode 17.0.0 Implementations

📦 Packages

UAX #11: East Asian Width

UTS #51: Unicode Emoji

UAX #50: Vertical Text Layout

UAX #9: Bidirectional Algorithm

UAX #14: Line Breaking Algorithm

UAX #29: Text Segmentation

🏆 Achievements

📜 License

🙏 Acknowledgments

Uh oh!