|
| 1 | +# uax29 - Unicode Text Segmentation |
| 2 | + |
| 3 | +Implementation of [UAX #29: Unicode Text Segmentation](https://www.unicode.org/reports/tr29/) in Go. |
| 4 | + |
| 5 | +**Status:** Implemented with official Unicode test vectors (Unicode 17.0) |
| 6 | + |
| 7 | +## Overview |
| 8 | + |
| 9 | +This package will provide algorithms for breaking text into meaningful units: |
| 10 | +- **Grapheme clusters**: User-perceived characters (what users think of as "characters") |
| 11 | +- **Words**: Linguistic word boundaries for text selection and cursor movement |
| 12 | +- **Sentences**: Sentence boundaries for text processing |
| 13 | + |
| 14 | +## Planned Features |
| 15 | + |
| 16 | +### Grapheme Cluster Boundaries |
| 17 | +- Proper handling of combining marks (e.g., `e` + `́` = `é`) |
| 18 | +- Hangul syllable composition |
| 19 | +- Emoji sequences with Zero Width Joiner (ZWJ) |
| 20 | +- Regional indicator sequences (flag emojis) |
| 21 | +- Variation selectors |
| 22 | + |
| 23 | +### Word Boundaries |
| 24 | +- Alphabetic and numeric sequences |
| 25 | +- Proper handling of contractions (don't, can't) |
| 26 | +- Punctuation boundaries |
| 27 | +- CJK word segmentation (requires dictionary) |
| 28 | +- Hyphenated words |
| 29 | + |
| 30 | +### Sentence Boundaries |
| 31 | +- Period, question mark, exclamation handling |
| 32 | +- Abbreviation detection (Dr., Mrs., etc.) |
| 33 | +- Quote and parenthesis handling |
| 34 | +- Whitespace rules |
| 35 | +- Multiple punctuation handling (e.g., `...`, `?!`) |
| 36 | + |
| 37 | +## Use Cases |
| 38 | + |
| 39 | +- Text editors: cursor movement, selection, deletion |
| 40 | +- Search: tokenization and indexing |
| 41 | +- Natural language processing |
| 42 | +- Text-to-speech: proper phrase boundaries |
| 43 | +- Terminal UIs: text selection and wrapping |
| 44 | + |
| 45 | +## Implementation Status |
| 46 | + |
| 47 | +### Grapheme Cluster Boundaries ✅ (100.0% pass rate - 766/766) 🎉 |
| 48 | +- **COMPLETE** implementation with Unicode 17.0 test vectors |
| 49 | +- Handles combining marks, Hangul syllables, all emoji sequences |
| 50 | +- Regional indicator pairs (flag emojis) working correctly |
| 51 | +- Prepend characters properly supported |
| 52 | +- Emoji modifiers (skin tones) correctly classified as Extend |
| 53 | +- GB11: Emoji ZWJ sequences fully implemented |
| 54 | +- GB9c: Indic conjunct sequences for 10+ scripts (Devanagari, Bengali, Gujarati, Oriya, Telugu, Malayalam, Myanmar, Balinese, Sundanese, Khmer) |
| 55 | + |
| 56 | +### Word Boundaries ✅ (100.0% pass rate - 1944/1944) 🎉 |
| 57 | +- **COMPLETE** implementation with Unicode 17.0 test vectors |
| 58 | +- Handles all alphabetic/numeric sequences, contractions, punctuation |
| 59 | +- Regional indicator pairs with ZWJ transparency |
| 60 | +- Hebrew letter handling with single/double quotes |
| 61 | +- Katakana sequences and ExtendNumLet |
| 62 | +- Emoji sequences with modifiers and ZWJ |
| 63 | +- Proper handling of Format character exceptions |
| 64 | + |
| 65 | +### Sentence Boundaries ✅ (100.0% pass rate - 512/512) 🎉 |
| 66 | +- **COMPLETE** implementation with Unicode 17.0 test vectors |
| 67 | +- Handles all sentence terminators (., ?, !, and many script-specific terminators) |
| 68 | +- Proper handling of abbreviations with ATerm |
| 69 | +- Complex Close* Sp* sequences correctly processed |
| 70 | +- SB8: Lowercase handling after ATerm Close* Sp* |
| 71 | +- SB8a: SContinue and sentence terminal sequences |
| 72 | +- SB9/SB10: Close and space handling after terminators |
| 73 | +- SB11: Breaking after sentence terminal sequences |
| 74 | + |
| 75 | +## Examples (Planned) |
| 76 | + |
| 77 | +```go |
| 78 | +// Grapheme clusters |
| 79 | +text := "👨👩👧👦" // Family emoji (multiple codepoints) |
| 80 | +graphemes := uax29.Graphemes(text) |
| 81 | +// Returns 1 grapheme cluster |
| 82 | + |
| 83 | +// Words |
| 84 | +text := "Hello, world!" |
| 85 | +words := uax29.Words(text) |
| 86 | +// Returns: ["Hello", ",", " ", "world", "!"] |
| 87 | + |
| 88 | +// Sentences |
| 89 | +text := "Hello Dr. Smith. How are you?" |
| 90 | +sentences := uax29.Sentences(text) |
| 91 | +// Returns: ["Hello Dr. Smith. ", "How are you?"] |
| 92 | +``` |
| 93 | + |
| 94 | +## Dependencies |
| 95 | + |
| 96 | +This package depends on: |
| 97 | +- **UTS #51 (Unicode Emoji)**: Provides authoritative emoji property data (Extended_Pictographic, Regional_Indicator, Emoji_Modifier, ZeroWidthJoiner constant) |
| 98 | + |
| 99 | +## Integration with Other Standards |
| 100 | + |
| 101 | +- **UTS #51 (Unicode Emoji)**: Emoji sequences are treated as single grapheme clusters per UAX #29 GB11 |
| 102 | +- **UAX #14 (Line Breaking)**: UAX #29 word boundaries inform line break decisions |
| 103 | +- **UAX #9 (Bidirectional)**: Both needed for proper text layout |
| 104 | +- **UAX #11 (East Asian Width)**: Terminal cursor movement should respect grapheme boundaries |
| 105 | + |
| 106 | +## References |
| 107 | + |
| 108 | +- [UAX #29: Unicode Text Segmentation](https://www.unicode.org/reports/tr29/) |
| 109 | +- [Grapheme Cluster Boundaries](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) |
| 110 | +- [Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries) |
| 111 | +- [Sentence Boundaries](https://www.unicode.org/reports/tr29/#Sentence_Boundaries) |
| 112 | + |
| 113 | +## License |
| 114 | + |
| 115 | +MIT |
0 commit comments