Skip to content

Bug: " Tool calls support from mainline" patch causes VRAM overflow for models that worked before #756

@Lissanro

Description

@Lissanro

What happened?

I no longer can load Kimi K2 after "Tool calls support from mainline" (0f9ecae) patch:

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 44063.76 MiB on device 0: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/neuro/models/Kimi-K2-Instruct/Kimi-K2-Instruct-IQ4_XS.gguf'
 ERR [              load_model] unable to load model | tid="131404752666624" timestamp=1756967078 model="/mnt/neuro/models/Kimi-K2-Instruct/Kimi-K2-Instruct-IQ4_XS.gguf"
free(): invalid pointer

I use this command:

numactl --cpunodebind=0 --interleave=all /home/lissanro/pkgs/ik_llama.cpp/build/bin/llama-server \
--model /mnt/neuro/models/Kimi-K2-Instruct/Kimi-K2-Instruct-IQ4_XS.gguf \
--ctx-size 131072 --n-gpu-layers 62 --tensor-split 15,25,30,30 -mla 3 -fa -ctk q8_0 -amb 512 -fmoe -b 4096 -ub 4096 \
-ot "blk\.3\.ffn_up_exps=CUDA0, blk\.3\.ffn_gate_exps=CUDA0, blk\.3\.ffn_down_exps=CUDA0" \
-ot "blk\.4\.ffn_up_exps=CUDA1, blk\.4\.ffn_gate_exps=CUDA1, blk\.4\.ffn_down_exps=CUDA1" \
-ot "blk\.5\.ffn_up_exps=CUDA2, blk\.5\.ffn_gate_exps=CUDA2, blk\.5\.ffn_down_exps=CUDA2" \
-ot "blk\.6\.ffn_up_exps=CUDA3, blk\.6\.ffn_gate_exps=CUDA3, blk\.6\.ffn_down_exps=CUDA3" \
-ot "ffn_down_exps=CPU, ffn_up_exps=CPU, gate_exps=CPU" \
--threads 64 --host 0.0.0.0 --port 5000 \
--slot-save-path /var/cache/ik_llama.cpp/k2

Reverting the patch 0f9ecae helps to fix this issue. Additionally to reverting it, I also had to add this in src/llama-vocab.cpp to prevent a minor compile error due to some missing logging defines:

LLAMA_ATTRIBUTE_FORMAT(2, 3)
void llama_log_internal        (ggml_log_level level, const char * format, ...);
void llama_log_callback_default(ggml_log_level level, const char * text, void * user_data);

#define LLAMA_LOG_INFO(...)  llama_log_internal(GGML_LOG_LEVEL_INFO , __VA_ARGS__)
#define LLAMA_LOG_DEBUG(...)  llama_log_internal(GGML_LOG_LEVEL_DEBUG , __VA_ARGS__)
#define LLAMA_LOG_WARN(...)  llama_log_internal(GGML_LOG_LEVEL_WARN , __VA_ARGS__)
#define LLAMA_LOG_ERROR(...) llama_log_internal(GGML_LOG_LEVEL_ERROR, __VA_ARGS__)

Name and Version

latest git

What operating system are you seeing the problem on?

Linux

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions