Misc. bug: Slower performance on newer models Apriel-1.5-15b, granite-4.0-h-tiny

### Name and Version

bash  llama-server --version
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
version: 6700 (3df2244d)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu

### Operating systems

Linux
=========
VULKANINFO
==========

Vulkan Instance Version: 1.3.275

========
GPU0:
	apiVersion         = 1.4.318
	driverVersion      = 25.2.3
	vendorID           = 0x1002
	deviceID           = 0x15e7
	deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
	deviceName         = AMD Radeon Graphics (RADV RENOIR)
	driverID           = DRIVER_ID_MESA_RADV
	driverName         = radv
	driverInfo         = Mesa 25.2.3 - kisak-mesa PPA
	conformanceVersion = 1.4.0.0
	deviceUUID         = 00000000-0300-0000-0000-000000000000
	driverUUID         = 414d442d-4d45-5341-2d44-525600000000

hostnamectl
Operating System: Ubuntu 24.04.3 LTS              
          Kernel: Linux 6.14.0-33-generic
    Architecture: x86-64
 Hardware Vendor: GMKtec
  Hardware Model: M5 PLUS
Firmware Version: M5 PLUS 1.03



### Which llama.cpp modules do you know to be affected?

llama-server, llama-bench

### Command line

```shell

```

### Problem description & steps to reproduce

I have using Qwen3 A3B models (coder, instruct and thinking) and I am getting 12 tps token generation on Q8_0 quantization. I decided to give newer models a try which have A1B or A1.5B parameters. I though that token generation would higher on these models. But it is actually much less. I have given llama-bench commands that I run below. The same is observed in llama-server. 

For example; Checking the Apriel-1.5-15b, I though it would be having double the token generation speed but it is horribly slow. 
Also for some reason llama bench is detecting it as llama 34B Q8_0 and I am only getting 3.12 tps on generation as compared to 12.79 tps on qwen3 A3B. Prompt processing is also significantly slower. 

Similarly I tried granite-4.0-h-tiny. It is not as slow as Apriel-1.5-15b but considering it is A1B and total 7B it is still has same token generation speed as Qwen3 A3B 30B. Granite tiny has 1B active parameters whereas Qwen3 A3B has three time more active parameters that is 3B. 

I want to understand why it is so, is it correct this is how it is suppose to be or implementation of newer models will improve with time in the llama.cpp.

Excuse my ignorance if it something is obvious, please share why so.  

### First Bad Commit

I am not sure if it is actually a bug. I want to inquire. 

### Relevant log output

```shell
ash  llama-bench -m /home/tipu/AI/models/other/Qwen3-Coder-30B-A3B-Distill/Qwen3-30B-A3B-Instruct-Coder-480B-Distill-v2-Q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
```
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           pp512 |         94.84 ± 0.53 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           tg128 |         12.79 ± 0.09 |

build: 3df2244d (6700)
```
 tipu-dev-machine   ~/Applications/llamaserver                                                                                    10:12:30 
 bash  llama-bench -m /home/tipu/AI/models/unsloth/Apriel/Apriel-1.5-15b-Thinker-Q8_0.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
```
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | ---: | --------------: | -------------------: |
| llama 34B Q8_0                 |  14.28 GiB |    14.43 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           pp512 |         50.11 ± 0.15 |
| llama 34B Q8_0                 |  14.28 GiB |    14.43 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           tg128 |          3.12 ± 0.00 |

build: 3df2244d (6700)
```
 tipu-dev-machine   ~/Applications/llamaserver                                                                                    10:17:17 
 bash  llama-bench -m /home/tipu/AI/models/unsloth/Granite_4_tiny/granite-4.0-h-tiny-UD-Q8_K_XL.gguf --ubatch-size 4096 --batch-size 512 --threads 4 --mmap 0 
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
```
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | ---: | --------------: | -------------------: |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           pp512 |        210.72 ± 1.30 |
| granitehybrid ?B Q8_0          |   7.73 GiB |     6.94 B | Vulkan     |  99 |       4 |     512 |     4096 |    0 |           tg128 |         12.53 ± 0.02 |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Slower performance on newer models Apriel-1.5-15b, granite-4.0-h-tiny #16454

Name and Version

Operating systems

Linux

VULKANINFO

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	size	params	backend	ngl	threads	n_batch	n_ubatch	mmap	test	t/s
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	Vulkan	99	4	512	4096	0	pp512	94.84 ± 0.53
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	Vulkan	99	4	512	4096	0	tg128	12.79 ± 0.09

model	size	params	backend	ngl	threads	n_batch	n_ubatch	mmap	test	t/s
llama 34B Q8_0	14.28 GiB	14.43 B	Vulkan	99	4	512	4096	0	pp512	50.11 ± 0.15
llama 34B Q8_0	14.28 GiB	14.43 B	Vulkan	99	4	512	4096	0	tg128	3.12 ± 0.00

model	size	params	backend	ngl	threads	n_batch	n_ubatch	mmap	test	t/s
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	4	512	4096	0	pp512	210.72 ± 1.30
granitehybrid ?B Q8_0	7.73 GiB	6.94 B	Vulkan	99	4	512	4096	0	tg128	12.53 ± 0.02

Misc. bug: Slower performance on newer models Apriel-1.5-15b, granite-4.0-h-tiny #16454

Description

Name and Version

Operating systems

Linux

VULKANINFO

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions