GH-3655: Tokenization on predict by alanakbik · Pull Request #3668 · flairNLP/flair

alanakbik · 2025-06-04T12:15:34Z

This PR changes the way tokenization works during prediction. It is now possible for models to remember the specific tokenizer they were trained with. This allows us to ensure that when predicting tags, both the model and the sentence objects follow the same tokenization scheme. Theoretically, this should yield better prediction accuracy.

Specifically:

Models that inherit from DefaultClassifier - if trained with a specific tokenizer - now set this tokenizer to any Sentence object that is passed to them during prediction. Same for the SequenceTagger.
The Sentence object now remembers the tokenization used to generate Tokens. There is a new setter that allows setting a different tokenizer for an already created Sentence. If a different tokenizer is set and tokens are requested, this triggers a retokenization with the new tokenization scheme.
All Tokenizers are now serializable and have equality defined.

Closes #3655

alanakbik added 9 commits June 3, 2025 21:25

Save and load tokenizer in Model

9210806

Add test

e30927e

Fix recursion error

215fbaa

Working solution with non-lazy tokenization

4889571

Enable lazy tokenization | Equality checks for tokenizers

e33014e

Mypy errors fixed

fa4e14e

Remove printline

7113a73

Fix mypy errors

fa26fec

Print full text

d35c319

alanakbik merged commit e2d30b9 into master Jun 4, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-3655: Tokenization on predict#3668

GH-3655: Tokenization on predict#3668
alanakbik merged 9 commits intomasterfrom
GH-3655-tokenization-on-predict

alanakbik commented Jun 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

alanakbik commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alanakbik commented Jun 4, 2025 •

edited

Loading