Skip to content

Commit 79703da

Browse files
committed
Add UAX #29 (Text Segmentation)
Implements UAX #29 for breaking text into grapheme clusters, words, and sentences. Essential for text editors, search engines, and NLP applications. Dependencies: - UTS #51 (Unicode Emoji) Key features: - Grapheme cluster boundaries (user-perceived characters) - Word boundaries for text selection and cursor movement - Sentence boundaries for text processing - Emoji ZWJ sequences and regional indicators - Indic conjunct sequences for 10+ scripts - 100% conformance (3,222/3,222 test cases passing) - Unicode 17.0.0 support
1 parent f7b39d3 commit 79703da

17 files changed

+11379
-0
lines changed

README.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,40 @@ for i := 1; i < len(breaks); i++ {
184184

185185
[Full Documentation →](./uax14/README.md)
186186

187+
### UAX #29: Text Segmentation (`uax29`)
188+
189+
[![Go Reference](https://pkg.go.dev/badge/github.com/SCKelemen/unicode/uax29.svg)](https://pkg.go.dev/github.com/SCKelemen/unicode/uax29)
190+
191+
Implementation of [UAX #29 (Unicode Text Segmentation)](https://www.unicode.org/reports/tr29/) for breaking text into grapheme clusters, words, and sentences.
192+
193+
**Dependencies:** UTS #51 (Unicode Emoji)
194+
195+
**Key Features:**
196+
- Grapheme cluster boundaries (user-perceived characters)
197+
- Word boundaries (text selection, cursor movement)
198+
- Sentence boundaries (text processing)
199+
- Emoji sequence handling with ZWJ
200+
- Regional indicator sequences (flag emojis)
201+
- Indic conjunct sequences
202+
- 100% conformance (3,222/3,222 test cases passing)
203+
204+
**Quick Example:**
205+
```go
206+
import "github.com/SCKelemen/unicode/uax29"
207+
208+
// Break text into grapheme clusters
209+
text := "👨‍👩‍👧‍👦 Hello"
210+
clusters := uax29.Graphemes(text)
211+
212+
// Find word boundaries
213+
words := uax29.Words("Hello, world!")
214+
215+
// Segment sentences
216+
sentences := uax29.Sentences("Hello Dr. Smith. How are you?")
217+
```
218+
219+
[Full Documentation →](./uax29/README.md)
220+
187221
## References
188222

189223
### Metastandards

uax29/GraphemeBreakTest.txt

Lines changed: 796 additions & 0 deletions
Large diffs are not rendered by default.

uax29/README.md

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# uax29 - Unicode Text Segmentation
2+
3+
Implementation of [UAX #29: Unicode Text Segmentation](https://www.unicode.org/reports/tr29/) in Go.
4+
5+
**Status:** Implemented with official Unicode test vectors (Unicode 17.0)
6+
7+
## Overview
8+
9+
This package will provide algorithms for breaking text into meaningful units:
10+
- **Grapheme clusters**: User-perceived characters (what users think of as "characters")
11+
- **Words**: Linguistic word boundaries for text selection and cursor movement
12+
- **Sentences**: Sentence boundaries for text processing
13+
14+
## Planned Features
15+
16+
### Grapheme Cluster Boundaries
17+
- Proper handling of combining marks (e.g., `e` + `́` = `é`)
18+
- Hangul syllable composition
19+
- Emoji sequences with Zero Width Joiner (ZWJ)
20+
- Regional indicator sequences (flag emojis)
21+
- Variation selectors
22+
23+
### Word Boundaries
24+
- Alphabetic and numeric sequences
25+
- Proper handling of contractions (don't, can't)
26+
- Punctuation boundaries
27+
- CJK word segmentation (requires dictionary)
28+
- Hyphenated words
29+
30+
### Sentence Boundaries
31+
- Period, question mark, exclamation handling
32+
- Abbreviation detection (Dr., Mrs., etc.)
33+
- Quote and parenthesis handling
34+
- Whitespace rules
35+
- Multiple punctuation handling (e.g., `...`, `?!`)
36+
37+
## Use Cases
38+
39+
- Text editors: cursor movement, selection, deletion
40+
- Search: tokenization and indexing
41+
- Natural language processing
42+
- Text-to-speech: proper phrase boundaries
43+
- Terminal UIs: text selection and wrapping
44+
45+
## Implementation Status
46+
47+
### Grapheme Cluster Boundaries ✅ (100.0% pass rate - 766/766) 🎉
48+
- **COMPLETE** implementation with Unicode 17.0 test vectors
49+
- Handles combining marks, Hangul syllables, all emoji sequences
50+
- Regional indicator pairs (flag emojis) working correctly
51+
- Prepend characters properly supported
52+
- Emoji modifiers (skin tones) correctly classified as Extend
53+
- GB11: Emoji ZWJ sequences fully implemented
54+
- GB9c: Indic conjunct sequences for 10+ scripts (Devanagari, Bengali, Gujarati, Oriya, Telugu, Malayalam, Myanmar, Balinese, Sundanese, Khmer)
55+
56+
### Word Boundaries ✅ (100.0% pass rate - 1944/1944) 🎉
57+
- **COMPLETE** implementation with Unicode 17.0 test vectors
58+
- Handles all alphabetic/numeric sequences, contractions, punctuation
59+
- Regional indicator pairs with ZWJ transparency
60+
- Hebrew letter handling with single/double quotes
61+
- Katakana sequences and ExtendNumLet
62+
- Emoji sequences with modifiers and ZWJ
63+
- Proper handling of Format character exceptions
64+
65+
### Sentence Boundaries ✅ (100.0% pass rate - 512/512) 🎉
66+
- **COMPLETE** implementation with Unicode 17.0 test vectors
67+
- Handles all sentence terminators (., ?, !, and many script-specific terminators)
68+
- Proper handling of abbreviations with ATerm
69+
- Complex Close* Sp* sequences correctly processed
70+
- SB8: Lowercase handling after ATerm Close* Sp*
71+
- SB8a: SContinue and sentence terminal sequences
72+
- SB9/SB10: Close and space handling after terminators
73+
- SB11: Breaking after sentence terminal sequences
74+
75+
## Examples (Planned)
76+
77+
```go
78+
// Grapheme clusters
79+
text := "👨‍👩‍👧‍👦" // Family emoji (multiple codepoints)
80+
graphemes := uax29.Graphemes(text)
81+
// Returns 1 grapheme cluster
82+
83+
// Words
84+
text := "Hello, world!"
85+
words := uax29.Words(text)
86+
// Returns: ["Hello", ",", " ", "world", "!"]
87+
88+
// Sentences
89+
text := "Hello Dr. Smith. How are you?"
90+
sentences := uax29.Sentences(text)
91+
// Returns: ["Hello Dr. Smith. ", "How are you?"]
92+
```
93+
94+
## Dependencies
95+
96+
This package depends on:
97+
- **UTS #51 (Unicode Emoji)**: Provides authoritative emoji property data (Extended_Pictographic, Regional_Indicator, Emoji_Modifier, ZeroWidthJoiner constant)
98+
99+
## Integration with Other Standards
100+
101+
- **UTS #51 (Unicode Emoji)**: Emoji sequences are treated as single grapheme clusters per UAX #29 GB11
102+
- **UAX #14 (Line Breaking)**: UAX #29 word boundaries inform line break decisions
103+
- **UAX #9 (Bidirectional)**: Both needed for proper text layout
104+
- **UAX #11 (East Asian Width)**: Terminal cursor movement should respect grapheme boundaries
105+
106+
## References
107+
108+
- [UAX #29: Unicode Text Segmentation](https://www.unicode.org/reports/tr29/)
109+
- [Grapheme Cluster Boundaries](https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)
110+
- [Word Boundaries](https://www.unicode.org/reports/tr29/#Word_Boundaries)
111+
- [Sentence Boundaries](https://www.unicode.org/reports/tr29/#Sentence_Boundaries)
112+
113+
## License
114+
115+
MIT

0 commit comments

Comments
 (0)