Skip to content

Conversation

@davidmroth
Copy link

@davidmroth davidmroth commented Jan 26, 2025

Thank you for the great work that you're doing!! Much appreciated!

Updated llama_token_is_eog to use llama_vocab_is_eog instead. Fixes this error:

2025-01-26 17:09:39.200 | INFO     | __main__:__main__:55 - Model error: module 'llama_cpp.llama_cpp' has no attribute 'llama_token_is_eog'

llama_token_is_eog was deprecated two weeks ago by this merge:

ggml-org/llama.cpp@afa8a9e#diff-dd7c3bc82b7d2728f1fe661e7820aebbfd53f558a5fdd85bbd6dd618213a118d

With your PR, I don't get the dredded: llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'deepseek-r1-qwen' error, but now I get no output. Still researching...

Using this code to test

"""
Test the Llama model using the libllama library.

Requires the following to be installed:
- Nvidia container toolkit for GPU acceleration
"""

import sys
import subprocess
import llama_cpp

from llama_cpp import Llama

try:
    from loguru import logger
except ImportError:

    def install_package(package_name):
        subprocess.check_call([sys.executable, "-m", "pip", "install", package_name])

    install_package("loguru")
    from loguru import logger

WEIGHTS_DIR = "/weights"
llm = None


def __main__() -> Llama | None:
    global llm

    try:
        model_path = f"{WEIGHTS_DIR}/Qwen2.5-32B-Instruct-Q5_K_S.gguf"
        logger.info(f"\n\n>>>>>>>>>>>>>>> Loading: {model_path} using llama_cpp version: {llama_cpp.__version__}\n\n ")

        # Initialize the model
        llm = Llama(
            model_path=model_path,
            n_ctx=2048,
            n_gpu_layers=-1,
            n_threads=8,
            last_n_tokens_size=64,
            max_tokens=2048,
            temperature=0.1,
            top_p=0.95,
            repeat_penalty=1.3,
            top_k=50,
        )

        response = llm("What is the capital of France?", max_tokens=64)

        logger.info(f"\n\n[*] Output: {response}\n\n")

    except Exception as exc:
        llm = None
        logger.info(f"Model error: {exc}")


__main__()

Output

>>>>>>>>>>>>>>> Loading: /weights/DeepSeek-R1-Distill-Qwen-7B-f16.gguf using llama_cpp version: 0.3.6

 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23064 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 339 tensors from /weights/DeepSeek-R1-Distill-Qwen-7B-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 7B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 7B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - kv  25:                          general.file_type u32              = 1
llama_model_loader: - kv  26:                      quantize.imatrix.file str              = /models_out/DeepSeek-R1-Distill-Qwen-...
llama_model_loader: - kv  27:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  28:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  29:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type  f16:  198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 14.19 GiB (16.00 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151645 '<|Assistant|>' is not marked as EOG
load: control token: 151644 '<|User|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG
load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151647 '<|EOT|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3584
print_info: n_layer          = 28
print_info: n_head           = 28
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 18944
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 7B
print_info: model params     = 7.62 B
print_info: general.name     = DeepSeek R1 Distill Qwen 7B
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
print_info: LF token         = 148848 'ÄĬ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: tensor 'token_embd.weight' (f16) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:        CUDA0 model buffer size = 13486.77 MiB
load_tensors:   CPU_Mapped model buffer size =  1039.50 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 2048
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init: layer 0: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 1: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 2: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 3: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 4: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 5: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 6: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 7: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 8: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 9: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 10: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 11: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 12: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 13: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 14: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 15: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 16: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 17: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 18: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 19: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 20: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 21: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 22: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 23: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 24: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 25: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 26: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 27: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init:      CUDA0 KV buffer size =   112.00 MiB
llama_init_from_model: KV self size  =  112.00 MiB, K (f16):   56.00 MiB, V (f16):   56.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_init_from_model:      CUDA0 compute buffer size =   304.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    11.01 MiB
llama_init_from_model: graph nodes  = 986
llama_init_from_model: graph splits = 2
CUDA : ARCHS = 520,610,700,750 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
Model metadata: {'quantize.imatrix.entries_count': '196', 'quantize.imatrix.dataset': '/training_dir/calibration_datav3.txt', 'quantize.imatrix.chunks_count': '128', 'quantize.imatrix.file': '/models_out/DeepSeek-R1-Distill-Qwen-7B-GGUF/DeepSeek-R1-Distill-Qwen-7B.imatrix', 'general.file_type': '1', 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.add_bos_token': 'true', 'tokenizer.ggml.bos_token_id': '151646', 'general.architecture': 'qwen2', 'tokenizer.ggml.padding_token_id': '151643', 'general.basename': 'DeepSeek-R1-Distill-Qwen', 'qwen2.embedding_length': '3584', 'tokenizer.ggml.pre': 'deepseek-r1-qwen', 'general.name': 'DeepSeek R1 Distill Qwen 7B', 'qwen2.block_count': '28', 'general.type': 'model', 'general.size_label': '7B', 'qwen2.context_length': '131072', 'tokenizer.chat_template': "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<|tool▁call▁end|>'}}{%- set ns.is_first = true -%}{%- else %}{{'\\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<|tool▁call▁end|>'}}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin |><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\\n<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}", 'qwen2.attention.head_count_kv': '4', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'gpt2', 'qwen2.feed_forward_length': '18944', 'qwen2.attention.layer_norm_rms_epsilon': '0.000001', 'qwen2.attention.head_count': '28', 'tokenizer.ggml.eos_token_id': '151643', 'qwen2.rope.freq_base': '10000.000000'}
Available chat formats from metadata: chat_template.default
Using gguf chat template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}
Using chat eos_token: <|end▁of▁sentence|>
Using chat bos_token: <|begin▁of▁sentence|>
2025-01-26 16:08:28.410 | INFO     | __main__:__main__:51 - 

[*] Output: 

abetlen#1901

@JamePeng
Copy link
Owner

LGTM

@JamePeng JamePeng merged commit 017e2a6 into JamePeng:main Jan 27, 2025
@davidmroth davidmroth deleted the fix-deprecated branch January 28, 2025 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants