Huggingface tokenizer provides incorrect model_max_length #7393
-
How to reproduce the behaviourI'm using
It's tokenizer
How can I set correct Your EnvironmentInfo about spaCy
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
There are some models where this setting is just missing from the config and some where The solution is to save a local copy of the model with the updated setting: from transformers import AutoTokenizer, AutoModel
name = "nlpaueb/legal-bert-base-uncased"
local_path = "/path/to/legal-bert-base-uncased"
model = AutoModel.from_pretrained(name)
tokenizer = AutoTokenizer.from_pretrained(name)
# add the setting (note that you can modify tokenizer.model_max_length on the fly,
# but frustratingly this change isn't saved as part of the saved config)
tokenizer.init_kwargs["model_max_length"] = 512
# save
tokenizer.save_pretrained(local_path)
model.save_pretrained(local_path) Then for |
Beta Was this translation helpful? Give feedback.
-
Update: a much simpler solution is to set this in [components.transformer.model.tokenizer_config]
use_fast = true
model_max_length = 512 |
Beta Was this translation helpful? Give feedback.
Update: a much simpler solution is to set this in
[components.transformer.model.tokenizer_config]
: