Skip to content
Discussion options

You must be logged in to vote

The way the tokenizer works is somewhat complicated, so there's not just a list of abbreviations anywhere, but the related data is in tokenizer_exceptions.py. That link is for English, but there are different settings for different languages.

If you want to customize the tokenizer, see the tokenization docs.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by pulkitmehtawork
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
2 participants