LlamaTokenizer class issue

`The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.`

HI! I'm running the LLM Tuner UI and run into this issue, which has been solved in another issue [https://github.com/huggingface/transformers/issues/22222#issuecomment-1477171703](url). However, whenever I try to simply change the LlamaTokenizer name in `tokenizer_config.json` in the Huggingface cache `~/.cache/huggingface/hub/models--decapoda-research--llama-7b-hf`, other issues pop whenever running the app. 

```
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████| 33/33 [00:13<00:00,  2.52it/s]
Traceback (most recent call last):
  File "llm_tuner/app.py", line 147, in <module>
    fire.Fire(main)
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "llm_tuner/app.py", line 119, in main
    prepare_base_model(Config.default_base_model_name)
  File "/home/gcpuser/sky_workdir/llm_tuner/llama_lora/models.py", line 262, in prepare_base_model
    Global.new_base_model_that_is_ready_to_be_used = get_new_base_model(
  File "/home/gcpuser/sky_workdir/llm_tuner/llama_lora/models.py", line 80, in get_new_base_model
    tokenizer = get_tokenizer(base_model_name)
  File "/home/gcpuser/sky_workdir/llm_tuner/llama_lora/models.py", line 156, in get_tokenizer
    raise e
  File "/home/gcpuser/sky_workdir/llm_tuner/llama_lora/models.py", line 143, in get_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 700, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1811, in from_pretrained
    return cls._from_pretrained(
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1965, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 89, in __init__
    super().__init__(
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 114, in __init__
    fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 1288, in convert_slow_tokenizer
    return converter_class(transformer_tokenizer).converted()
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/convert_slow_tokenizer.py", line 445, in __init__
    from .utils import sentencepiece_model_pb2 as model_pb2
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/transformers/utils/sentencepiece_model_pb2.py", line 91, in <module>
    _descriptor.EnumValueDescriptor(
  File "/opt/conda/envs/llm-tuner/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 796, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.

```
Any idea on how to tackle this so that the model and tokenizer will match properly? And any insight on if it will affect finetuning results if I didn't match up the classnames earlier? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LlamaTokenizer class issue #40

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

LlamaTokenizer class issue #40

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions