Skip to content
Discussion options

You must be logged in to vote

It's probably not possible to 100% avoid this problem for random/gibberish input using spacy's default tokenizer because it can end up with very long single tokens that will be split into many transformer wordpiece tokens.

In practice with relatively sensible natural language input, you can probably avoid most of these cases by removing the url_match tokenizer pattern and letting the tokenizer split long URLs up on punctuation, which will get the spacy tokenization closer to the wordpiece tokenization.

Docs on customizing the tokenizer for training: https://spacy.io/usage/training#custom-tokenizer

But if this warning is rare, it is probably fine to ignore it. Since it only affects a singl…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by billziss-gh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer feat / transformer Feature: Transformer
2 participants