-
Notifications
You must be signed in to change notification settings - Fork 14.8k
Description
Name and Version
ggml_cuda_init: found 4 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 7971 (5fa1c19)
built with GNU 14.2.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
EPYC 7763 + 1 TB RAM + 4x3090 GPUs
Models
Kimi K2.5
Problem description & steps to reproduce
First, run the model like this (likely will be reproducible with very small vision-enabled models too, but for larger ones it is much more noticeable issue since when the model is big, prefilling 100K-200K tokens from scratch to return to a dialog or Roo Code project takes long time if cannot restore saved cache):
numactl --cpunodebind=0 --interleave=all /home/lissanro/pkgs/llama.cpp/build/bin/llama-server \
--model /mnt/neuro/models/Kimi-K2.5-Q4_X-VL.gguf \
--mmproj /mnt/neuro/models/Kimi-K2.5/mmproj-Kimi-K2.5-F16.gguf \
--fit on --fit-ctx 262144 -b 4096 -ub 4096 -fa on \
--threads 64 --host 0.0.0.0 --port 5000 \
--jinja \
--slot-save-path /var/cache/llama.cpp/k2.5 --cache-ram 131072 --fit-target 512 \
--min-p 0.01 --top-p 0.95 --temp 1.0 --top-k 100
Then try to save the cache:
curl --header "Content-Type: application/json" --request POST --data '{"filename":"cache.bin"}' "http://localhost:5000/slots/3?action=save"
It fails to save the cache. Please refer to the "Relevant log output" section for the exact error message. For K2.5 support I used #19127 but I think this issue is reproducible with any vision model, based on the error message.
First Bad Commit
No response
Relevant log output
{"error":{"code":501,"message":"This feature is not supported by multimodal","type":"not_supported_error"}}