BasicTokenizer: Multiple tokenization issues (hyphens, ordinals, number formatting, Korean script)

## Description

The `BasicTokenizer` has several edge cases that result in incorrect tokenization behavior. These issues affect real-world social media text processing.

## Issues Discovered

### 1. Hyphens are not preserved in compound words
**Current Behavior:**
```python
tokenizer.tokenize("self-aware")
# Result: ["self", "aware"]  # Hyphen lost
```

**Expected Behavior:**
```python
# Result: ["self-aware"]  # Hyphen preserved
```

**Root Cause:**  
`LATIN_WORD_PATTERN` in `services/tokenizer/basic/patterns.py:102` only handles periods and apostrophes:
```python
LATIN_WORD_PATTERN = r"[a-zA-Z]+(?:\.[a-zA-Z]+)+\.?|[a-zA-Z]+(?:\'[a-zA-Z]+)*"
```

Missing: hyphen support for compound words like "self-aware", "co-founder", "twenty-one"

---

### 2. Ordinal numbers lose their numeric prefix
**Current Behavior:**
```python
tokenizer.tokenize("6th amendment right")
# Result: ["th", "amendment", "right"]  # "6" is lost
```

**Expected Behavior:**
```python
# Result: ["6th", "amendment", "right"]
```

**Root Cause:**  
`NUMERIC_PATTERN` in `patterns.py:41-48` includes ordinals:
```python
r"\d+(?:st|nd|rd|th)?"  # Ordinals
```

However, the pattern matching priority may be causing issues. The `LATIN_WORD_PATTERN` may be matching "th" separately, or there's a filtering issue in postprocessing.

**Investigation needed:** Check pattern order in `get_comprehensive_pattern()` and verify ordinal tokens aren't being split during postprocessing.

---

### 3. Large numbers with separators are truncated
**Current Behavior:**
```python
tokenizer.tokenize("200,000 Fraudulent Ballots")
# Result: ["000", "fraudulent", "ballots"]  # "200," is lost
```

**Expected Behavior:**
```python
# Result: ["200,000", "fraudulent", "ballots"]
# Or: ["200000", "fraudulent", "ballots"]  # Normalized
```

**Root Cause:**  
`NUMERIC_PATTERN` in `patterns.py:41-48` has:
```python
r"\d+[.,]\d+|"  # Numbers with comma/period separators
```

This pattern only matches **one** separator (e.g., "123,456" or "3.14"), but NOT multiple separators like "200,000" or "1,234,567.89".

**Fix needed:** Update pattern to handle multiple thousand separators:
```python
r"\d{1,3}(?:[.,]\d{3})+(?:\.\d+)?|"  # Thousand separators + optional decimal
```

---

### 4. Korean text is broken into characters instead of words
**Current Behavior:**
```python
tokenizer.tokenize("안녕하세요 세계")  # "Hello world" in Korean
# Result: ["안", "녕", "하", "세", "요", "세", "계"]  # Character-level
```

**Expected Behavior:**
```python
# Result: ["안녕하세요", "세계"]  # Space-separated words
```

**Root Cause:**  
Korean (Hangul) is incorrectly classified as a character-level script in `BasicTokenizer._is_char_level_script()` (`tokenizer.py:75-88`):

```python
def _is_char_level_script(self, char: str) -> bool:
    code_point = ord(char)
    return (
        ...
        or (0xAC00 <= code_point <= 0xD7AF)  # Hangul Syllables ❌ WRONG!
        ...
    )
```

**Issue:** Korean (Hangul) is a **space-separated language** like English and Arabic, NOT a scriptio continua language like Chinese/Japanese/Thai.

**Fix needed:**
1. Remove Hangul from `_is_char_level_script()`
2. Add Hangul detection to space-separated word tokenization logic
3. Update `_get_char_script()` to return "korean" or "hangul" for proper handling

---

## Impact

- **Issue #1 (Hyphens):** Affects compound words, hyphenated names, and multi-word hashtags
- **Issue #2 (Ordinals):** Affects dates, rankings, amendments, and numbered lists
- **Issue #3 (Large numbers):** Affects vote counts, statistics, and financial data
- **Issue #4 (Korean):** Breaks all Korean language support, making text unintelligible

## Affected Files

1. `services/tokenizer/basic/patterns.py` - Pattern definitions
2. `services/tokenizer/basic/tokenizer.py` - Script detection logic
3. `services/tokenizer/core/types.py` - Language family definitions (may need KOREAN added)

## Proposed Solution

### Priority 1 (Critical): Korean Script Fix
1. Remove Hangul from character-level script detection
2. Add Korean/Hangul to space-separated language handling
3. Add test cases for Korean tokenization

### Priority 2 (High): Number and Ordinal Fixes
1. Fix numeric pattern to support multiple thousand separators
2. Verify ordinal pattern matching order and priority
3. Add test cases for large numbers and ordinals

### Priority 3 (Medium): Hyphen Support
1. Add hyphen support to LATIN_WORD_PATTERN
2. Ensure hyphens are preserved in compound words
3. Add test cases for hyphenated words

## Test Cases Needed

```python
# Korean (space-separated)
def test_korean_space_separated():
    tokenizer = BasicTokenizer()
    text = "안녕하세요 세계"
    result = tokenizer.tokenize(text)
    expected = ["안녕하세요", "세계"]
    assert result == expected

# Hyphens
def test_hyphenated_words():
    tokenizer = BasicTokenizer()
    text = "self-aware co-founder"
    result = tokenizer.tokenize(text)
    expected = ["self-aware", "co-founder"]
    assert result == expected

# Ordinals
def test_ordinal_numbers():
    tokenizer = BasicTokenizer()
    text = "6th amendment 21st century"
    result = tokenizer.tokenize(text)
    expected = ["6th", "amendment", "21st", "century"]
    assert result == expected

# Large numbers with separators
def test_large_numbers_with_separators():
    tokenizer = BasicTokenizer()
    text = "200,000 fraudulent ballots and 1,234,567 votes"
    result = tokenizer.tokenize(text)
    expected = ["200,000", "fraudulent", "ballots", "and", "1,234,567", "votes"]
    assert result == expected
```

## Additional Context

These issues were discovered during real-world usage with social media data containing:
- Political discussion (ordinals, large numbers)
- Korean language content (multilingual support)
- Technical terminology (hyphenated compounds)

## References

- Code locations identified in investigation
- Existing test suite: `services/tokenizer/basic/test_basic_tokenizer.py`
- Pattern definitions: `services/tokenizer/basic/patterns.py:41-108`
- Script detection: `services/tokenizer/basic/tokenizer.py:75-124`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BasicTokenizer: Multiple tokenization issues (hyphens, ordinals, number formatting, Korean script) #236

Description

Issues Discovered

1. Hyphens are not preserved in compound words

2. Ordinal numbers lose their numeric prefix

3. Large numbers with separators are truncated

4. Korean text is broken into characters instead of words

Impact

Affected Files

Proposed Solution

Priority 1 (Critical): Korean Script Fix

Priority 2 (High): Number and Ordinal Fixes

Priority 3 (Medium): Hyphen Support

Test Cases Needed

Additional Context

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BasicTokenizer: Multiple tokenization issues (hyphens, ordinals, number formatting, Korean script) #236

Description

Description

Issues Discovered

1. Hyphens are not preserved in compound words

2. Ordinal numbers lose their numeric prefix

3. Large numbers with separators are truncated

4. Korean text is broken into characters instead of words

Impact

Affected Files

Proposed Solution

Priority 1 (Critical): Korean Script Fix

Priority 2 (High): Number and Ordinal Fixes

Priority 3 (Medium): Hyphen Support

Test Cases Needed

Additional Context

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions