Skip to content

Vulkan: GGML_ASSERT failed on Kimi-Linear-48B with large context - maxComputeWorkGroupCount exceeded #19471

@iz0eyj

Description

@iz0eyj

Name and Version

federico@Sogliola:~$ llama-server --version
load_backend: loaded RPC backend from /home/federico/.local/share/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from /home/federico/.local/share/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/federico/.local/share/llamacpp/libggml-cpu-zen4.so
version: 7966 (8872ad2)
built with GNU 11.4.0 for Linux x86_64

Operating systems

Linux

GGML backends

Vulkan

Hardware

AMD Ryzen AI 9 HX370 (Strix Halo) with integrated Radeon 890M GPU

Models

https://huggingface.co/bartowski/Kimi-Linear-48B-Instruct-GGUF (model-00001-of-00002.gguf, Q8_0 quantization)

Problem description & steps to reproduce

Problem Description

When running Kimi-Linear-48B-Instruct Q8_0 with llama-server and Vulkan backend, the server crashes with a GGML_ASSERT failure during prompt processing. The assertion fails because the computed workgroup dimensions exceed the GPU's maxComputeWorkGroupCount limits.

Steps to Reproduce

  1. Start llama-server with these parameters:
llama-server -c 65536 --context-shift -b 8192 -ub 2048 -fa on --no-mmap --jinja --host 0.0.0.0 --port 1234 -m /path/to/Kimi-Linear-48B-Instruct-Q8_0.gguf
  1. Load a moderately sized text file (~38KB, approximately 13609 tokens)

  2. Server crashes during prompt processing after processing 8192 tokens

Error Output

slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 8192, progress = 0.601955
/home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:6225: GGML_ASSERT(wg0 <= ctx->device->properties.limits.maxComputeWorkGroupCount[0] && wg1 <= ctx->device->properties.limits.maxComputeWorkGroupCount[1] && wg2 <= ctx->device->properties.limits.maxComputeWorkGroupCount[2]) failed

Additional Context

  • Same test with Qwen3-VL-30B-8bit works perfectly with identical parameters
  • Issue appears specific to Kimi-Linear-48B model with large batch sizes
  • Crash consistently happens at the same point during prompt processing
  • Model works fine with small inputs (chat mode), only fails with larger context

First Bad Commit

No response

Relevant log output

slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 8192, progress = 0.601955
/home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:6225: GGML_ASSERT(wg0 <= ctx->device->properties.limits.maxComputeWorkGroupCount[0] && wg1 <= ctx->device->properties.limits.maxComputeWorkGroupCount[1] && wg2 <= ctx->device->properties.limits.maxComputeWorkGroupCount[2]) failed

/home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:6225: GGML_ASSERT(wg0 <= ctx->device->properties.limits.maxComputeWorkGroupCount[0] && wg1 <= ctx->device->properties.limits.maxComputeWorkGroupCount[1] && wg2 <= ctx->device->properties.limits.maxComputeWorkGroupCount[2]) failed

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions