Pretrained model with custom tokenizer / word vectors #8160
-
Hello ! I have 2 questions that may be a little stupid. In our project, we have custom word vectors and a custom tokenizer. And for the training we also want to use pretrained models (transformers) to improve the accuracy. My questions are:
Thanks in advance for your answers ! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
If you are using only I think you could also potentially run into poor results if your custom token boundaries frequently don't align with the transformer tokenizer boundaries at all because it would make it hard to align the transformer output with the spacy tokens, but that would usually be extremely odd tokenization from the custom tokenizer. It shouldn't fail to train, but the results might be worse. (This is probably mainly a theoretical edge case, like if your custom tokenizer split If you have custom word vectors, they are not used with the |
Beta Was this translation helpful? Give feedback.
If you are using only
transformer
and nottok2vec
(either as a separate component or internal to the component with an architecture with some form ofHashEmbed
), then the custom tokenizer will mainly affect the tokenization that you see in the resulting spacydoc
. During training, thetransformer
component is using the transformer tokenizer internally and not the spacy tokenization. However, if there are a lot of alignment issues between your gold annotation and the predicted tokenization from the custom tokenizer, this would affect the training and evaluation for a component likener
, but this is true for bothtransformer
andtok2vec
.I think you could also potentially run into poor resu…