-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Description
The BasicTokenizer has several edge cases that result in incorrect tokenization behavior. These issues affect real-world social media text processing.
Issues Discovered
1. Hyphens are not preserved in compound words
Current Behavior:
tokenizer.tokenize("self-aware")
# Result: ["self", "aware"] # Hyphen lostExpected Behavior:
# Result: ["self-aware"] # Hyphen preservedRoot Cause:
LATIN_WORD_PATTERN in services/tokenizer/basic/patterns.py:102 only handles periods and apostrophes:
LATIN_WORD_PATTERN = r"[a-zA-Z]+(?:\.[a-zA-Z]+)+\.?|[a-zA-Z]+(?:\'[a-zA-Z]+)*"Missing: hyphen support for compound words like "self-aware", "co-founder", "twenty-one"
2. Ordinal numbers lose their numeric prefix
Current Behavior:
tokenizer.tokenize("6th amendment right")
# Result: ["th", "amendment", "right"] # "6" is lostExpected Behavior:
# Result: ["6th", "amendment", "right"]Root Cause:
NUMERIC_PATTERN in patterns.py:41-48 includes ordinals:
r"\d+(?:st|nd|rd|th)?" # OrdinalsHowever, the pattern matching priority may be causing issues. The LATIN_WORD_PATTERN may be matching "th" separately, or there's a filtering issue in postprocessing.
Investigation needed: Check pattern order in get_comprehensive_pattern() and verify ordinal tokens aren't being split during postprocessing.
3. Large numbers with separators are truncated
Current Behavior:
tokenizer.tokenize("200,000 Fraudulent Ballots")
# Result: ["000", "fraudulent", "ballots"] # "200," is lostExpected Behavior:
# Result: ["200,000", "fraudulent", "ballots"]
# Or: ["200000", "fraudulent", "ballots"] # NormalizedRoot Cause:
NUMERIC_PATTERN in patterns.py:41-48 has:
r"\d+[.,]\d+|" # Numbers with comma/period separatorsThis pattern only matches one separator (e.g., "123,456" or "3.14"), but NOT multiple separators like "200,000" or "1,234,567.89".
Fix needed: Update pattern to handle multiple thousand separators:
r"\d{1,3}(?:[.,]\d{3})+(?:\.\d+)?|" # Thousand separators + optional decimal4. Korean text is broken into characters instead of words
Current Behavior:
tokenizer.tokenize("안녕하세요 세계") # "Hello world" in Korean
# Result: ["안", "녕", "하", "세", "요", "세", "계"] # Character-levelExpected Behavior:
# Result: ["안녕하세요", "세계"] # Space-separated wordsRoot Cause:
Korean (Hangul) is incorrectly classified as a character-level script in BasicTokenizer._is_char_level_script() (tokenizer.py:75-88):
def _is_char_level_script(self, char: str) -> bool:
code_point = ord(char)
return (
...
or (0xAC00 <= code_point <= 0xD7AF) # Hangul Syllables ❌ WRONG!
...
)Issue: Korean (Hangul) is a space-separated language like English and Arabic, NOT a scriptio continua language like Chinese/Japanese/Thai.
Fix needed:
- Remove Hangul from
_is_char_level_script() - Add Hangul detection to space-separated word tokenization logic
- Update
_get_char_script()to return "korean" or "hangul" for proper handling
Impact
- Issue File chooser can't change drive letter on Windows #1 (Hyphens): Affects compound words, hyphenated names, and multi-word hashtags
- Issue Added HTML output for time coordination test #2 (Ordinals): Affects dates, rankings, amendments, and numbered lists
- Issue Time of Day Interval Analysis I #3 (Large numbers): Affects vote counts, statistics, and financial data
- Issue [BUG] hashtags outputs to export fails when choosing CSV output (badly formated export path, attempts saving as parquet file) #4 (Korean): Breaks all Korean language support, making text unintelligible
Affected Files
services/tokenizer/basic/patterns.py- Pattern definitionsservices/tokenizer/basic/tokenizer.py- Script detection logicservices/tokenizer/core/types.py- Language family definitions (may need KOREAN added)
Proposed Solution
Priority 1 (Critical): Korean Script Fix
- Remove Hangul from character-level script detection
- Add Korean/Hangul to space-separated language handling
- Add test cases for Korean tokenization
Priority 2 (High): Number and Ordinal Fixes
- Fix numeric pattern to support multiple thousand separators
- Verify ordinal pattern matching order and priority
- Add test cases for large numbers and ordinals
Priority 3 (Medium): Hyphen Support
- Add hyphen support to LATIN_WORD_PATTERN
- Ensure hyphens are preserved in compound words
- Add test cases for hyphenated words
Test Cases Needed
# Korean (space-separated)
def test_korean_space_separated():
tokenizer = BasicTokenizer()
text = "안녕하세요 세계"
result = tokenizer.tokenize(text)
expected = ["안녕하세요", "세계"]
assert result == expected
# Hyphens
def test_hyphenated_words():
tokenizer = BasicTokenizer()
text = "self-aware co-founder"
result = tokenizer.tokenize(text)
expected = ["self-aware", "co-founder"]
assert result == expected
# Ordinals
def test_ordinal_numbers():
tokenizer = BasicTokenizer()
text = "6th amendment 21st century"
result = tokenizer.tokenize(text)
expected = ["6th", "amendment", "21st", "century"]
assert result == expected
# Large numbers with separators
def test_large_numbers_with_separators():
tokenizer = BasicTokenizer()
text = "200,000 fraudulent ballots and 1,234,567 votes"
result = tokenizer.tokenize(text)
expected = ["200,000", "fraudulent", "ballots", "and", "1,234,567", "votes"]
assert result == expectedAdditional Context
These issues were discovered during real-world usage with social media data containing:
- Political discussion (ordinals, large numbers)
- Korean language content (multilingual support)
- Technical terminology (hyphenated compounds)
References
- Code locations identified in investigation
- Existing test suite:
services/tokenizer/basic/test_basic_tokenizer.py - Pattern definitions:
services/tokenizer/basic/patterns.py:41-108 - Script detection:
services/tokenizer/basic/tokenizer.py:75-124