Skip to content

LlamaTokenizer class issue #40

@SJayYangNN

Description

@SJayYangNN

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. The class this function is called from is 'LlamaTokenizer'.

HI! I'm running the LLM Tuner UI and run into this issue, which has been solved in another issue https://github.com/huggingface/transformers/issues/22222#issuecomment-1477171703. However, whenever I try to simply change the LlamaTokenizer name in tokenizer_config.json in the Huggingface cache ~/.cache/huggingface/hub/models--decapoda-research--llama-7b-hf, other issues pop whenever running the app.

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████| 33/33 [00:13<00:00,  2.52it/s]
Traceback (most recent call last):
  File "llm_tuner/app.py", line 147, in <module>
    fire.Fire(main)
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "llm_tuner/app.py", line 119, in main
    prepare_base_model(Config.default_base_model_name)
  File "/home/gcpuser/sky_workdir/llm_tuner/llama_lora/models.py", line 262, in prepare_base_model
    Global.new_base_model_that_is_ready_to_be_used = get_new_base_model(
  File "/home/gcpuser/sky_workdir/llm_tuner/llama_lora/models.py", line 80, in get_new_base_model
    tokenizer = get_tokenizer(base_model_name)
  File "/home/gcpuser/sky_workdir/llm_tuner/llama_lora/models.py", line 156, in get_tokenizer
    raise e
  File "/home/gcpuser/sky_workdir/llm_tuner/llama_lora/models.py", line 143, in get_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 700, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1811, in from_pretrained
    return cls._from_pretrained(
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1965, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 89, in __init__
    super().__init__(
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 114, in __init__
    fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 1288, in convert_slow_tokenizer
    return converter_class(transformer_tokenizer).converted()
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 445, in __init__
    from .utils import sentencepiece_model_pb2 as model_pb2
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/utils/sentencepiece_model_pb2.py", line 91, in <module>
    _descriptor.EnumValueDescriptor(
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 796, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.

Any idea on how to tackle this so that the model and tokenizer will match properly? And any insight on if it will affect finetuning results if I didn't match up the classnames earlier?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions