Pretrained model with custom tokenizer / word vectors #8160

Pandalei97 · 2021-05-20T13:39:42Z

Pandalei97
May 20, 2021

Hello ! I have 2 questions that may be a little stupid.

In our project, we have custom word vectors and a custom tokenizer. And for the training we also want to use pretrained models (transformers) to improve the accuracy.

My questions are:

Does it make sense to use a pretrained model that was trained from their own tokenizer which is different from the ours ?
If we want to use a transformer, as I see, it will replace the tok2vec layer. In this case, we are not using our word vectors at all ?

Thanks in advance for your answers !

Answered by adrianeboyd

May 21, 2021

If you are using only transformer and not tok2vec (either as a separate component or internal to the component with an architecture with some form of HashEmbed), then the custom tokenizer will mainly affect the tokenization that you see in the resulting spacy doc. During training, the transformer component is using the transformer tokenizer internally and not the spacy tokenization. However, if there are a lot of alignment issues between your gold annotation and the predicted tokenization from the custom tokenizer, this would affect the training and evaluation for a component like ner, but this is true for both transformer and tok2vec.

I think you could also potentially run into poor resu…

View full answer

adrianeboyd · 2021-05-21T10:05:28Z

adrianeboyd
May 21, 2021

If you are using only transformer and not tok2vec (either as a separate component or internal to the component with an architecture with some form of HashEmbed), then the custom tokenizer will mainly affect the tokenization that you see in the resulting spacy doc. During training, the transformer component is using the transformer tokenizer internally and not the spacy tokenization. However, if there are a lot of alignment issues between your gold annotation and the predicted tokenization from the custom tokenizer, this would affect the training and evaluation for a component like ner, but this is true for both transformer and tok2vec.

I think you could also potentially run into poor results if your custom token boundaries frequently don't align with the transformer tokenizer boundaries at all because it would make it hard to align the transformer output with the spacy tokens, but that would usually be extremely odd tokenization from the custom tokenizer. It shouldn't fail to train, but the results might be worse. (This is probably mainly a theoretical edge case, like if your custom tokenizer split (The into (T he.)

If you have custom word vectors, they are not used with the transformer component. You can still include them in your model for similarity methods and other uses, but they won't be used as features for the statistical components that listen to the transformer. For the built-in components, if you see include_static_vectors = true (spacy.MultiHashEmbed.v2) or pretrained_vectors = "model_name" (spacy.HashEmbedCNN.v2) as settings in the config, then this is where it's using the word vectors as features.

4 replies

Pandalei97 May 21, 2021
Author

I'm getting a little confused in some point.
If I am just training a NER or TEXTCAT model, we can use both transformer and tok2vec at the same time ? I though that the model can only listen to one of the them.

adrianeboyd May 21, 2021

Sorry, that was confusing.

For the model.tok2vec settings, the ner or textcat components can only use either transformer or tok2vec, not both.

You can use transformer for ner and then also include static word vectors in the model if you want vectors for token.similarity.

Pandalei97 May 21, 2021
Author

Okay ! Thanks a lot that's clear for me now.

Just one more question, you said

I think you could also potentially run into poor results if your custom token boundaries frequently don't align with the transformer tokenizer boundaries at all because it would make it hard to align the transformer output with the spacy tokens

Does that mean that Spacy have some sort of internal methods to manage the difference ? or that the difference of the spacy default tokenization rules are very similar to those of the transformer so that it generally wouldn't cause a significant loss ?

In our custom tokenizer, we have a list of tokenizer exceptions and some additional infix/suffix rules for some symbols (like '-'/'+'/'-') based on spacy default rules. I don't think that we will have too odd tokenizations. Will it still be appropriate to use transformers in this case ?

adrianeboyd May 21, 2021

I think that should be fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Pretrained model with custom tokenizer / word vectors #8160

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Pretrained model with custom tokenizer / word vectors #8160

Uh oh!

Uh oh!

Pandalei97 May 20, 2021

Replies: 1 comment · 4 replies

Uh oh!

adrianeboyd May 21, 2021

Uh oh!

Uh oh!

Pandalei97 May 21, 2021 Author

Uh oh!

adrianeboyd May 21, 2021

Uh oh!

Uh oh!

Pandalei97 May 21, 2021 Author

Uh oh!

adrianeboyd May 21, 2021

Pandalei97
May 20, 2021

Replies: 1 comment 4 replies

adrianeboyd
May 21, 2021

Pandalei97 May 21, 2021
Author

Pandalei97 May 21, 2021
Author