Skip to content

GH-3655: Tokenization on predict#3668

Merged
alanakbik merged 9 commits intomasterfrom
GH-3655-tokenization-on-predict
Jun 4, 2025
Merged

GH-3655: Tokenization on predict#3668
alanakbik merged 9 commits intomasterfrom
GH-3655-tokenization-on-predict

Conversation

@alanakbik
Copy link
Copy Markdown
Collaborator

@alanakbik alanakbik commented Jun 4, 2025

This PR changes the way tokenization works during prediction. It is now possible for models to remember the specific tokenizer they were trained with. This allows us to ensure that when predicting tags, both the model and the sentence objects follow the same tokenization scheme. Theoretically, this should yield better prediction accuracy.

Specifically:

  • Models that inherit from DefaultClassifier - if trained with a specific tokenizer - now set this tokenizer to any Sentence object that is passed to them during prediction. Same for the SequenceTagger.
  • The Sentence object now remembers the tokenization used to generate Tokens. There is a new setter that allows setting a different tokenizer for an already created Sentence. If a different tokenizer is set and tokens are requested, this triggers a retokenization with the new tokenization scheme.
  • All Tokenizers are now serializable and have equality defined.

Closes #3655

@alanakbik alanakbik merged commit e2d30b9 into master Jun 4, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Make Tokenization part of Flair models

1 participant