Skip to content

Misc. bug: Slower performance on newer models Apriel-1.5-15b, granite-4.0-h-tiny #16454

@engrtipusultan

Description

@engrtipusultan

Name and Version

bash  llama-server --version
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
version: 6700 (3df2244)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

VULKANINFO

Vulkan Instance Version: 1.3.275

========
GPU0:
apiVersion = 1.4.318
driverVersion = 25.2.3
vendorID = 0x1002
deviceID = 0x15e7
deviceType = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
deviceName = AMD Radeon Graphics (RADV RENOIR)
driverID = DRIVER_ID_MESA_RADV
driverName = radv
driverInfo = Mesa 25.2.3 - kisak-mesa PPA
conformanceVersion = 1.4.0.0
deviceUUID = 00000000-0300-0000-0000-000000000000
driverUUID = 414d442d-4d45-5341-2d44-525600000000

hostnamectl
Operating System: Ubuntu 24.04.3 LTS
Kernel: Linux 6.14.0-33-generic
Architecture: x86-64
Hardware Vendor: GMKtec
Hardware Model: M5 PLUS
Firmware Version: M5 PLUS 1.03

Which llama.cpp modules do you know to be affected?

llama-server, llama-bench

Command line

Problem description & steps to reproduce

I have using Qwen3 A3B models (coder, instruct and thinking) and I am getting 12 tps token generation on Q8_0 quantization. I decided to give newer models a try which have A1B or A1.5B parameters. I though that token generation would higher on these models. But it is actually much less. I have given llama-bench commands that I run below. The same is observed in llama-server.

For example; Checking the Apriel-1.5-15b, I though it would be having double the token generation speed but it is horribly slow.
Also for some reason llama bench is detecting it as llama 34B Q8_0 and I am only getting 3.12 tps on generation as compared to 12.79 tps on qwen3 A3B. Prompt processing is also significantly slower.

Similarly I tried granite-4.0-h-tiny. It is not as slow as Apriel-1.5-15b but considering it is A1B and total 7B it is still has same token generation speed as Qwen3 A3B 30B. Granite tiny has 1B active parameters whereas Qwen3 A3B has three time more active parameters that is 3B.

I want to understand why it is so, is it correct this is how it is suppose to be or implementation of newer models will improve with time in the llama.cpp.

Excuse my ignorance if it something is obvious, please share why so.

First Bad Commit

I am not sure if it is actually a bug. I want to inquire.

Relevant log output

ash  llama-bench -m /home/tipu/AI/models/other/Qwen3-Coder-30B-A3B-Distill/Qwen3-30B-A3B-Instruct-Coder-480B-Distill-v2-Q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
model size params backend ngl threads n_batch n_ubatch mmap test t/s
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B Vulkan 99 4 512 4096 0 pp512 94.84 ± 0.53
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B Vulkan 99 4 512 4096 0 tg128 12.79 ± 0.09

build: 3df2244 (6700)

 tipu-dev-machine   ~/Applications/llamaserver                                                                                    10:12:30 
 bash  llama-bench -m /home/tipu/AI/models/unsloth/Apriel/Apriel-1.5-15b-Thinker-Q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
model size params backend ngl threads n_batch n_ubatch mmap test t/s
llama 34B Q8_0 14.28 GiB 14.43 B Vulkan 99 4 512 4096 0 pp512 50.11 ± 0.15
llama 34B Q8_0 14.28 GiB 14.43 B Vulkan 99 4 512 4096 0 tg128 3.12 ± 0.00

build: 3df2244 (6700)

 tipu-dev-machine   ~/Applications/llamaserver                                                                                    10:17:17 
 bash  llama-bench -m /home/tipu/AI/models/unsloth/Granite_4_tiny/granite-4.0-h-tiny-UD-Q8_K_XL.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
model size params backend ngl threads n_batch n_ubatch mmap test t/s
granitehybrid ?B Q8_0 7.73 GiB 6.94 B Vulkan 99 4 512 4096 0 pp512 210.72 ± 1.30
granitehybrid ?B Q8_0 7.73 GiB 6.94 B Vulkan 99 4 512 4096 0 tg128 12.53 ± 0.02

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions