Using methods in spacy config file #11490
-
Sometimes there is a need to add a special token to the tokenizer, for instance, when using gpt2 in spacy config: [components.transformer.model] [components.transformer.model.tokenizer_config] However, when the pad_token is added, the length of the embedding changes. Adding model.resize_token_embeddings(len(tokenizer)) in the next line of code after instantiating the model will rectify the issue. With spacy config file, can model.resize_token_embeddings(len(tokenizer)) be added any where so that we can still use cli to train a gpt2 model? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
I haven't tested this, but in general I think the right place to do this would be in the I haven't done much training/testing with gpt2 models, but I can see that in our tests we just use |
Beta Was this translation helpful? Give feedback.
I haven't tested this, but in general I think the right place to do this would be in the
after_init
callback.I haven't done much training/testing with gpt2 models, but I can see that in our tests we just use
"<|endoftext|>"
as the pad token, which is already in the vocab.