Skip to content

Improve StaccatoTokenizer#3673

Merged
alanakbik merged 4 commits intomasterfrom
staccato_mod
Jun 11, 2025
Merged

Improve StaccatoTokenizer#3673
alanakbik merged 4 commits intomasterfrom
staccato_mod

Conversation

@alanakbik
Copy link
Copy Markdown
Collaborator

This pull request refines the StaccatoTokenizer to improve its handling of special text constructs. The tokenizer's regular expression logic was updated to correctly process words containing diacritics, such as German umlauts, ensuring they are not split apart. Additionally, the tokenizer now properly identifies multi-part abbreviations like "e.g." or "U.S." as single tokens, while correctly separating single-word abbreviations from sentence-ending periods. Unit tests have been added to verify this new behavior.

@alanakbik alanakbik merged commit 5d04a9b into master Jun 11, 2025
2 checks passed
@alanakbik alanakbik deleted the staccato_mod branch June 11, 2025 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant