Improve StaccatoTokenizer by alanakbik · Pull Request #3673 · flairNLP/flair

alanakbik · 2025-06-11T12:24:06Z

This pull request refines the StaccatoTokenizer to improve its handling of special text constructs. The tokenizer's regular expression logic was updated to correctly process words containing diacritics, such as German umlauts, ensuring they are not split apart. Additionally, the tokenizer now properly identifies multi-part abbreviations like "e.g." or "U.S." as single tokens, while correctly separating single-word abbreviations from sentence-ending periods. Unit tests have been added to verify this new behavior.

alanakbik added 4 commits June 10, 2025 18:42

Add diacritics

0b69d12

Add handling for abbreviations

4f2b858

Add tests

88ef4e4

Black formatting

6e39cfc

alanakbik merged commit 5d04a9b into master Jun 11, 2025
2 checks passed

alanakbik deleted the staccato_mod branch June 11, 2025 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve StaccatoTokenizer#3673

Improve StaccatoTokenizer#3673
alanakbik merged 4 commits intomasterfrom
staccato_mod

alanakbik commented Jun 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

alanakbik commented Jun 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant