Skip to content

v4.0.0: Rule-Based State Machine Architecture

Choose a tag to compare

@SCKelemen SCKelemen released this 16 Dec 20:53
· 38 commits to main since this release

Version 4.0.0: Rule-Based State Machine Architecture

This release focuses on code quality and maintainability through rule-based state machine architecture for all break detection algorithms.

New Features

  • BreakContext abstractions: GraphemeBreakContext, WordBreakContext, SentenceBreakContext provide clean navigation APIs
  • Named rule functions: Each Unicode rule (GB3, WB5, SB8, etc.) becomes a named function with clear semantics
  • Declarative rule chains: Rules checked in order with first-match-wins strategy
  • Maintained hierarchical optimization: Words checked only at grapheme boundaries, sentences only at word boundaries

Code Organization

New files implementing the rule-based architecture:

  • context.go - Break context abstractions with navigation methods (661 lines)
  • grapheme_rules.go - Grapheme breaking rules (ruleGB3 through ruleGB12_13, 308 lines)
  • word_rules.go - Word breaking rules (ruleWB3 through ruleWB15_16, 376 lines)
  • sentence_rules.go - Sentence breaking rules (ruleSB3 through ruleSB11, 244 lines)
  • single_pass.go - Cleaned up to use rule-based implementations (96 lines vs 574 lines)

Performance (Apple M4 Pro)

Rule-based grapheme breaking alone:

Text Length v3.0.0 Inline v4.0.0 Rule-Based Speedup
Short (33 chars) 1,882 ns/op 1,183 ns/op 1.59x
Medium (86 chars) 8,759 ns/op 3,041 ns/op 2.88x
Long (467 chars) 168,060 ns/op 15,170 ns/op 11.08x

Single-Pass API:

Text Length v3.0.0 Inline v4.0.0 Rule-Based Change
Short (33 chars) 2,197 ns/op 2,717 ns/op 1.24x slower
Medium (86 chars) 9,636 ns/op 6,647 ns/op 1.45x faster
Long (467 chars) 188,982 ns/op 32,200 ns/op 5.87x faster

Single-Pass vs Three Separate Passes (v4.0.0):

Text Length Single Pass Three Separate Speedup
Short (33 chars) 2,717 ns/op 3,380 ns/op 1.24x
Medium (86 chars) 6,647 ns/op 14,312 ns/op 2.15x
Long (467 chars) 32,200 ns/op 239,624 ns/op 7.44x

Key findings:

  • Rule-based grapheme breaking provides 1.6-11x speedup over inline implementation
  • Performance improvements increase dramatically with text length
  • Single-pass API maintains significant advantage over three separate calls
  • Medium and long texts benefit most from rule-based architecture

Benefits

  • Readability: Rules directly match Unicode Standard specification
  • Maintainability: Easy to understand, modify, and extend
  • Debuggability: Each rule can be tested and traced independently

Conformance

100% conformance maintained on all official Unicode test suites:

  • Grapheme: 766/766 tests passing
  • Word: 1,944/1,944 tests passing
  • Sentence: 512/512 tests passing

Installation

go get github.com/SCKelemen/unicode/uax29@v4.0.0