Skip to content

BasicTokenizer: Multiple tokenization issues (hyphens, ordinals, number formatting, Korean script) #236

@JoeKarow

Description

@JoeKarow

Description

The BasicTokenizer has several edge cases that result in incorrect tokenization behavior. These issues affect real-world social media text processing.

Issues Discovered

1. Hyphens are not preserved in compound words

Current Behavior:

tokenizer.tokenize("self-aware")
# Result: ["self", "aware"]  # Hyphen lost

Expected Behavior:

# Result: ["self-aware"]  # Hyphen preserved

Root Cause:
LATIN_WORD_PATTERN in services/tokenizer/basic/patterns.py:102 only handles periods and apostrophes:

LATIN_WORD_PATTERN = r"[a-zA-Z]+(?:\.[a-zA-Z]+)+\.?|[a-zA-Z]+(?:\'[a-zA-Z]+)*"

Missing: hyphen support for compound words like "self-aware", "co-founder", "twenty-one"


2. Ordinal numbers lose their numeric prefix

Current Behavior:

tokenizer.tokenize("6th amendment right")
# Result: ["th", "amendment", "right"]  # "6" is lost

Expected Behavior:

# Result: ["6th", "amendment", "right"]

Root Cause:
NUMERIC_PATTERN in patterns.py:41-48 includes ordinals:

r"\d+(?:st|nd|rd|th)?"  # Ordinals

However, the pattern matching priority may be causing issues. The LATIN_WORD_PATTERN may be matching "th" separately, or there's a filtering issue in postprocessing.

Investigation needed: Check pattern order in get_comprehensive_pattern() and verify ordinal tokens aren't being split during postprocessing.


3. Large numbers with separators are truncated

Current Behavior:

tokenizer.tokenize("200,000 Fraudulent Ballots")
# Result: ["000", "fraudulent", "ballots"]  # "200," is lost

Expected Behavior:

# Result: ["200,000", "fraudulent", "ballots"]
# Or: ["200000", "fraudulent", "ballots"]  # Normalized

Root Cause:
NUMERIC_PATTERN in patterns.py:41-48 has:

r"\d+[.,]\d+|"  # Numbers with comma/period separators

This pattern only matches one separator (e.g., "123,456" or "3.14"), but NOT multiple separators like "200,000" or "1,234,567.89".

Fix needed: Update pattern to handle multiple thousand separators:

r"\d{1,3}(?:[.,]\d{3})+(?:\.\d+)?|"  # Thousand separators + optional decimal

4. Korean text is broken into characters instead of words

Current Behavior:

tokenizer.tokenize("안녕하세요 세계")  # "Hello world" in Korean
# Result: ["안", "녕", "하", "세", "요", "세", "계"]  # Character-level

Expected Behavior:

# Result: ["안녕하세요", "세계"]  # Space-separated words

Root Cause:
Korean (Hangul) is incorrectly classified as a character-level script in BasicTokenizer._is_char_level_script() (tokenizer.py:75-88):

def _is_char_level_script(self, char: str) -> bool:
    code_point = ord(char)
    return (
        ...
        or (0xAC00 <= code_point <= 0xD7AF)  # Hangul Syllables ❌ WRONG!
        ...
    )

Issue: Korean (Hangul) is a space-separated language like English and Arabic, NOT a scriptio continua language like Chinese/Japanese/Thai.

Fix needed:

  1. Remove Hangul from _is_char_level_script()
  2. Add Hangul detection to space-separated word tokenization logic
  3. Update _get_char_script() to return "korean" or "hangul" for proper handling

Impact

Affected Files

  1. services/tokenizer/basic/patterns.py - Pattern definitions
  2. services/tokenizer/basic/tokenizer.py - Script detection logic
  3. services/tokenizer/core/types.py - Language family definitions (may need KOREAN added)

Proposed Solution

Priority 1 (Critical): Korean Script Fix

  1. Remove Hangul from character-level script detection
  2. Add Korean/Hangul to space-separated language handling
  3. Add test cases for Korean tokenization

Priority 2 (High): Number and Ordinal Fixes

  1. Fix numeric pattern to support multiple thousand separators
  2. Verify ordinal pattern matching order and priority
  3. Add test cases for large numbers and ordinals

Priority 3 (Medium): Hyphen Support

  1. Add hyphen support to LATIN_WORD_PATTERN
  2. Ensure hyphens are preserved in compound words
  3. Add test cases for hyphenated words

Test Cases Needed

# Korean (space-separated)
def test_korean_space_separated():
    tokenizer = BasicTokenizer()
    text = "안녕하세요 세계"
    result = tokenizer.tokenize(text)
    expected = ["안녕하세요", "세계"]
    assert result == expected

# Hyphens
def test_hyphenated_words():
    tokenizer = BasicTokenizer()
    text = "self-aware co-founder"
    result = tokenizer.tokenize(text)
    expected = ["self-aware", "co-founder"]
    assert result == expected

# Ordinals
def test_ordinal_numbers():
    tokenizer = BasicTokenizer()
    text = "6th amendment 21st century"
    result = tokenizer.tokenize(text)
    expected = ["6th", "amendment", "21st", "century"]
    assert result == expected

# Large numbers with separators
def test_large_numbers_with_separators():
    tokenizer = BasicTokenizer()
    text = "200,000 fraudulent ballots and 1,234,567 votes"
    result = tokenizer.tokenize(text)
    expected = ["200,000", "fraudulent", "ballots", "and", "1,234,567", "votes"]
    assert result == expected

Additional Context

These issues were discovered during real-world usage with social media data containing:

  • Political discussion (ordinals, large numbers)
  • Korean language content (multilingual support)
  • Technical terminology (hyphenated compounds)

References

  • Code locations identified in investigation
  • Existing test suite: services/tokenizer/basic/test_basic_tokenizer.py
  • Pattern definitions: services/tokenizer/basic/patterns.py:41-108
  • Script detection: services/tokenizer/basic/tokenizer.py:75-124

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions