-
Notifications
You must be signed in to change notification settings - Fork 14.8k
Description
Name and Version
federico@Sogliola:~$ llama-server --version
load_backend: loaded RPC backend from /home/federico/.local/share/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from /home/federico/.local/share/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/federico/.local/share/llamacpp/libggml-cpu-zen4.so
version: 7966 (8872ad2)
built with GNU 11.4.0 for Linux x86_64
Operating systems
Linux
GGML backends
Vulkan
Hardware
AMD Ryzen AI 9 HX370 (Strix Halo) with integrated Radeon 890M GPU
Models
https://huggingface.co/bartowski/Kimi-Linear-48B-Instruct-GGUF (model-00001-of-00002.gguf, Q8_0 quantization)
Problem description & steps to reproduce
Problem Description
When running Kimi-Linear-48B-Instruct Q8_0 with llama-server and Vulkan backend, the server crashes with a GGML_ASSERT failure during prompt processing. The assertion fails because the computed workgroup dimensions exceed the GPU's maxComputeWorkGroupCount limits.
Steps to Reproduce
- Start llama-server with these parameters:
llama-server -c 65536 --context-shift -b 8192 -ub 2048 -fa on --no-mmap --jinja --host 0.0.0.0 --port 1234 -m /path/to/Kimi-Linear-48B-Instruct-Q8_0.gguf-
Load a moderately sized text file (~38KB, approximately 13609 tokens)
-
Server crashes during prompt processing after processing 8192 tokens
Error Output
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 8192, progress = 0.601955
/home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:6225: GGML_ASSERT(wg0 <= ctx->device->properties.limits.maxComputeWorkGroupCount[0] && wg1 <= ctx->device->properties.limits.maxComputeWorkGroupCount[1] && wg2 <= ctx->device->properties.limits.maxComputeWorkGroupCount[2]) failed
Additional Context
- Same test with Qwen3-VL-30B-8bit works perfectly with identical parameters
- Issue appears specific to Kimi-Linear-48B model with large batch sizes
- Crash consistently happens at the same point during prompt processing
- Model works fine with small inputs (chat mode), only fails with larger context
First Bad Commit
No response
Relevant log output
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 8192, progress = 0.601955
/home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:6225: GGML_ASSERT(wg0 <= ctx->device->properties.limits.maxComputeWorkGroupCount[0] && wg1 <= ctx->device->properties.limits.maxComputeWorkGroupCount[1] && wg2 <= ctx->device->properties.limits.maxComputeWorkGroupCount[2]) failed
/home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:6225: GGML_ASSERT(wg0 <= ctx->device->properties.limits.maxComputeWorkGroupCount[0] && wg1 <= ctx->device->properties.limits.maxComputeWorkGroupCount[1] && wg2 <= ctx->device->properties.limits.maxComputeWorkGroupCount[2]) failed