-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Description
Name and Version
bash llama-server --version
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
version: 6700 (3df2244)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
VULKANINFO
Vulkan Instance Version: 1.3.275
========
GPU0:
apiVersion = 1.4.318
driverVersion = 25.2.3
vendorID = 0x1002
deviceID = 0x15e7
deviceType = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
deviceName = AMD Radeon Graphics (RADV RENOIR)
driverID = DRIVER_ID_MESA_RADV
driverName = radv
driverInfo = Mesa 25.2.3 - kisak-mesa PPA
conformanceVersion = 1.4.0.0
deviceUUID = 00000000-0300-0000-0000-000000000000
driverUUID = 414d442d-4d45-5341-2d44-525600000000
hostnamectl
Operating System: Ubuntu 24.04.3 LTS
Kernel: Linux 6.14.0-33-generic
Architecture: x86-64
Hardware Vendor: GMKtec
Hardware Model: M5 PLUS
Firmware Version: M5 PLUS 1.03
Which llama.cpp modules do you know to be affected?
llama-server, llama-bench
Command line
Problem description & steps to reproduce
I have using Qwen3 A3B models (coder, instruct and thinking) and I am getting 12 tps token generation on Q8_0 quantization. I decided to give newer models a try which have A1B or A1.5B parameters. I though that token generation would higher on these models. But it is actually much less. I have given llama-bench commands that I run below. The same is observed in llama-server.
For example; Checking the Apriel-1.5-15b, I though it would be having double the token generation speed but it is horribly slow.
Also for some reason llama bench is detecting it as llama 34B Q8_0 and I am only getting 3.12 tps on generation as compared to 12.79 tps on qwen3 A3B. Prompt processing is also significantly slower.
Similarly I tried granite-4.0-h-tiny. It is not as slow as Apriel-1.5-15b but considering it is A1B and total 7B it is still has same token generation speed as Qwen3 A3B 30B. Granite tiny has 1B active parameters whereas Qwen3 A3B has three time more active parameters that is 3B.
I want to understand why it is so, is it correct this is how it is suppose to be or implementation of newer models will improve with time in the llama.cpp.
Excuse my ignorance if it something is obvious, please share why so.
First Bad Commit
I am not sure if it is actually a bug. I want to inquire.
Relevant log output
ash llama-bench -m /home/tipu/AI/models/other/Qwen3-Coder-30B-A3B-Distill/Qwen3-30B-A3B-Instruct-Coder-480B-Distill-v2-Q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
model | size | params | backend | ngl | threads | n_batch | n_ubatch | mmap | test | t/s |
---|---|---|---|---|---|---|---|---|---|---|
qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan | 99 | 4 | 512 | 4096 | 0 | pp512 | 94.84 ± 0.53 |
qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | Vulkan | 99 | 4 | 512 | 4096 | 0 | tg128 | 12.79 ± 0.09 |
build: 3df2244 (6700)
tipu-dev-machine ~/Applications/llamaserver 10:12:30
bash llama-bench -m /home/tipu/AI/models/unsloth/Apriel/Apriel-1.5-15b-Thinker-Q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
model | size | params | backend | ngl | threads | n_batch | n_ubatch | mmap | test | t/s |
---|---|---|---|---|---|---|---|---|---|---|
llama 34B Q8_0 | 14.28 GiB | 14.43 B | Vulkan | 99 | 4 | 512 | 4096 | 0 | pp512 | 50.11 ± 0.15 |
llama 34B Q8_0 | 14.28 GiB | 14.43 B | Vulkan | 99 | 4 | 512 | 4096 | 0 | tg128 | 3.12 ± 0.00 |
build: 3df2244 (6700)
tipu-dev-machine ~/Applications/llamaserver 10:17:17
bash llama-bench -m /home/tipu/AI/models/unsloth/Granite_4_tiny/granite-4.0-h-tiny-UD-Q8_K_XL.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
model | size | params | backend | ngl | threads | n_batch | n_ubatch | mmap | test | t/s |
---|---|---|---|---|---|---|---|---|---|---|
granitehybrid ?B Q8_0 | 7.73 GiB | 6.94 B | Vulkan | 99 | 4 | 512 | 4096 | 0 | pp512 | 210.72 ± 1.30 |
granitehybrid ?B Q8_0 | 7.73 GiB | 6.94 B | Vulkan | 99 | 4 | 512 | 4096 | 0 | tg128 | 12.53 ± 0.02 |