Train warning: Token indices sequence length is longer than the specified maximum #13032
-
I am trying to train a new textcat component and I am receiving the warning: Some other discussion (e.g. #9277) seems to suggest that this warning comes from the libraries underlying Spacy and not Spacy itself and that it can be ignored, because the transformer component in Spacy uses strided spans. Nevertheless I would like to fix my training data to avoid the problem if possible (e.g. remove long URL's). Any advice? Here is output from the training:
Here is my config:
And the spacy info:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
It's probably not possible to 100% avoid this problem for random/gibberish input using spacy's default tokenizer because it can end up with very long single tokens that will be split into many transformer wordpiece tokens. In practice with relatively sensible natural language input, you can probably avoid most of these cases by removing the Docs on customizing the tokenizer for training: https://spacy.io/usage/training#custom-tokenizer But if this warning is rare, it is probably fine to ignore it. Since it only affects a single span within the doc, the effect on the resulting annotation for textcat is probably small. |
Beta Was this translation helpful? Give feedback.
It's probably not possible to 100% avoid this problem for random/gibberish input using spacy's default tokenizer because it can end up with very long single tokens that will be split into many transformer wordpiece tokens.
In practice with relatively sensible natural language input, you can probably avoid most of these cases by removing the
url_match
tokenizer pattern and letting the tokenizer split long URLs up on punctuation, which will get the spacy tokenization closer to the wordpiece tokenization.Docs on customizing the tokenizer for training: https://spacy.io/usage/training#custom-tokenizer
But if this warning is rare, it is probably fine to ignore it. Since it only affects a singl…