Using methods in spacy config file #11490

noobistz · 2022-09-13T07:40:00Z

noobistz
Sep 13, 2022

Sometimes there is a need to add a special token to the tokenizer, for instance, when using gpt2 in spacy config:

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "gpt2"

[components.transformer.model.tokenizer_config]
use_fast = true
pad_token = '[PAD]'

However, when the pad_token is added, the length of the embedding changes. Adding model.resize_token_embeddings(len(tokenizer)) in the next line of code after instantiating the model will rectify the issue.

With spacy config file, can model.resize_token_embeddings(len(tokenizer)) be added any where so that we can still use cli to train a gpt2 model?

Answered by adrianeboyd

Sep 13, 2022

I haven't tested this, but in general I think the right place to do this would be in the after_init callback.

I haven't done much training/testing with gpt2 models, but I can see that in our tests we just use "<|endoftext|>" as the pad token, which is already in the vocab.

View full answer

adrianeboyd · 2022-09-13T08:09:29Z

adrianeboyd
Sep 13, 2022

I haven't tested this, but in general I think the right place to do this would be in the after_init callback.

I haven't done much training/testing with gpt2 models, but I can see that in our tests we just use "<|endoftext|>" as the pad token, which is already in the vocab.

1 reply

noobistz Sep 13, 2022
Author

thanks, using "<|endoftext|>" does indeed work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Using methods in spacy config file #11490

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Using methods in spacy config file #11490

Uh oh!

noobistz Sep 13, 2022

Replies: 1 comment · 1 reply

Uh oh!

adrianeboyd Sep 13, 2022

Uh oh!

noobistz Sep 13, 2022 Author

noobistz
Sep 13, 2022

Replies: 1 comment 1 reply

adrianeboyd
Sep 13, 2022

noobistz Sep 13, 2022
Author