Bug: " Tool calls support from mainline" patch causes VRAM overflow for models that worked before

### What happened?

I no longer can load Kimi K2 after "Tool calls support from mainline" (0f9ecaec04cf40dd7524fdc6625f43c13b8038f8) patch:

```
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 44063.76 MiB on device 0: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/mnt/neuro/models/Kimi-K2-Instruct/Kimi-K2-Instruct-IQ4_XS.gguf'
 ERR [              load_model] unable to load model | tid="131404752666624" timestamp=1756967078 model="/mnt/neuro/models/Kimi-K2-Instruct/Kimi-K2-Instruct-IQ4_XS.gguf"
free(): invalid pointer
```

I use this command:

```
numactl --cpunodebind=0 --interleave=all /home/lissanro/pkgs/ik_llama.cpp/build/bin/llama-server \
--model /mnt/neuro/models/Kimi-K2-Instruct/Kimi-K2-Instruct-IQ4_XS.gguf \
--ctx-size 131072 --n-gpu-layers 62 --tensor-split 15,25,30,30 -mla 3 -fa -ctk q8_0 -amb 512 -fmoe -b 4096 -ub 4096 \
-ot "blk\.3\.ffn_up_exps=CUDA0, blk\.3\.ffn_gate_exps=CUDA0, blk\.3\.ffn_down_exps=CUDA0" \
-ot "blk\.4\.ffn_up_exps=CUDA1, blk\.4\.ffn_gate_exps=CUDA1, blk\.4\.ffn_down_exps=CUDA1" \
-ot "blk\.5\.ffn_up_exps=CUDA2, blk\.5\.ffn_gate_exps=CUDA2, blk\.5\.ffn_down_exps=CUDA2" \
-ot "blk\.6\.ffn_up_exps=CUDA3, blk\.6\.ffn_gate_exps=CUDA3, blk\.6\.ffn_down_exps=CUDA3" \
-ot "ffn_down_exps=CPU, ffn_up_exps=CPU, gate_exps=CPU" \
--threads 64 --host 0.0.0.0 --port 5000 \
--slot-save-path /var/cache/ik_llama.cpp/k2
```

Reverting the patch 0f9ecaec04cf40dd7524fdc6625f43c13b8038f8 helps to fix this issue. Additionally to reverting it, I also had to add this in src/llama-vocab.cpp to prevent a minor compile error due to some missing logging defines:

```
LLAMA_ATTRIBUTE_FORMAT(2, 3)
void llama_log_internal        (ggml_log_level level, const char * format, ...);
void llama_log_callback_default(ggml_log_level level, const char * text, void * user_data);

#define LLAMA_LOG_INFO(...)  llama_log_internal(GGML_LOG_LEVEL_INFO , __VA_ARGS__)
#define LLAMA_LOG_DEBUG(...)  llama_log_internal(GGML_LOG_LEVEL_DEBUG , __VA_ARGS__)
#define LLAMA_LOG_WARN(...)  llama_log_internal(GGML_LOG_LEVEL_WARN , __VA_ARGS__)
#define LLAMA_LOG_ERROR(...) llama_log_internal(GGML_LOG_LEVEL_ERROR, __VA_ARGS__)
```

### Name and Version

latest git

### What operating system are you seeing the problem on?

Linux

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: " Tool calls support from mainline" patch causes VRAM overflow for models that worked before #756

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug: " Tool calls support from mainline" patch causes VRAM overflow for models that worked before #756

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions