-
Notifications
You must be signed in to change notification settings - Fork 32.2k
Open
Labels
Description
System Info
5.2.0
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
My Hugging Face tokenizer no longer loads in Transformers v5. The tokenizer is isaacus/kanon-2-tokenizer. I am seeing this error when attempting to load it:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[1], line 3
1 from transformers import AutoTokenizer
----> 3 tok = AutoTokenizer.from_pretrained("isaacus/kanon-2-tokenizer")
File ~/isaacus/cookbooks/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py:712, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
709 if tokenizer_class is None:
710 tokenizer_class = TokenizersBackend
--> 712 return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
713 elif getattr(config, "tokenizer_class", None):
714 _class = config.tokenizer_class
File ~/isaacus/cookbooks/.venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1712, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
1709 if file_id not in resolved_vocab_files:
1710 continue
-> 1712 return cls._from_pretrained(
1713 resolved_vocab_files,
1714 pretrained_model_name_or_path,
1715 init_configuration,
1716 *init_inputs,
1717 token=token,
1718 cache_dir=cache_dir,
1719 local_files_only=local_files_only,
1720 _commit_hash=commit_hash,
1721 _is_local=is_local,
1722 trust_remote_code=trust_remote_code,
1723 **kwargs,
1724 )
File ~/isaacus/cookbooks/.venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1839, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
1837 continue # User-provided kwargs take precedence
1838 if isinstance(value, dict) and key != "extra_special_tokens":
-> 1839 value = AddedToken(**value, special=True)
1840 elif key == "extra_special_tokens" and isinstance(value, list):
1841 # Merge list tokens, converting dicts to AddedToken
1842 existing = list(init_kwargs.get("extra_special_tokens") or [])
TypeError: tokenizers.AddedToken() got multiple values for keyword argument 'special'Expected behavior
The tokenizer should correctly load as it did previously.
Reactions are currently unavailable