Skip to content

TypeError: tokenizers.AddedToken() got multiple values for keyword argument 'special' #44062

@umarbutler

Description

@umarbutler

System Info

5.2.0

Who can help?

@ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

My Hugging Face tokenizer no longer loads in Transformers v5. The tokenizer is isaacus/kanon-2-tokenizer. I am seeing this error when attempting to load it:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 3
      1 from transformers import AutoTokenizer
----> 3 tok = AutoTokenizer.from_pretrained("isaacus/kanon-2-tokenizer")

File ~/isaacus/cookbooks/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py:712, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    709     if tokenizer_class is None:
    710         tokenizer_class = TokenizersBackend
--> 712     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    713 elif getattr(config, "tokenizer_class", None):
    714     _class = config.tokenizer_class

File ~/isaacus/cookbooks/.venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1712, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
   1709     if file_id not in resolved_vocab_files:
   1710         continue
-> 1712 return cls._from_pretrained(
   1713     resolved_vocab_files,
   1714     pretrained_model_name_or_path,
   1715     init_configuration,
   1716     *init_inputs,
   1717     token=token,
   1718     cache_dir=cache_dir,
   1719     local_files_only=local_files_only,
   1720     _commit_hash=commit_hash,
   1721     _is_local=is_local,
   1722     trust_remote_code=trust_remote_code,
   1723     **kwargs,
   1724 )

File ~/isaacus/cookbooks/.venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1839, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
   1837     continue  # User-provided kwargs take precedence
   1838 if isinstance(value, dict) and key != "extra_special_tokens":
-> 1839     value = AddedToken(**value, special=True)
   1840 elif key == "extra_special_tokens" and isinstance(value, list):
   1841     # Merge list tokens, converting dicts to AddedToken
   1842     existing = list(init_kwargs.get("extra_special_tokens") or [])

TypeError: tokenizers.AddedToken() got multiple values for keyword argument 'special'

Expected behavior

The tokenizer should correctly load as it did previously.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions