Misc. bug: Low performance against mainline

### Name and Version

kanken@loom:~$ /home/kanken/code/llama.cpp-gfx906/build/bin/llama-cli --version
ggml_cuda_init: found 4 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 2: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 3: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
version: 7973 (6091bf8e5)
built with Clang 18.0.0 for Linux x86_64

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

_No response_

### Command line

```shell

```

### Problem description & steps to reproduce

Hey all, i've been testing this out against mainline on my 3x MI50s, and the numbers i got are... strange.
mainline:
```c
kanken@loom:~$ HIP_VISIBLE_DEVICES=0,1,2 \
numactl --interleave=all \
/home/kanken/code/llama.cpp/build/bin/llama-bench \
  -m /mnt/models/Storage/unsloth/Qwen3-Coder-Next-GGUF/Q8_0/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf \
  -ngl 999 \
  -ts 1/1/1 \
  -fa 1 \
  -b 1024 \
  -ub 8192 \
  -t 16 \
  -mmp 0 \
  -p 512,2048,8192,16384,32768,65536 \
  -n 128 \
  --progress
ggml_cuda_init: found 3 ROCm devices (Total VRAM: 98256 MiB):
  Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB (32724 MiB free)
  Device 1: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB (32724 MiB free)
  Device 2: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB (32724 MiB free)
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | fa | ts           | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | ------------ | ---: | --------------: | -------------------: |
llama-bench: benchmark 1/7: starting
llama-bench: benchmark 1/7: warmup prompt run
llama-bench: benchmark 1/7: prompt run 1/5
llama-bench: benchmark 1/7: prompt run 2/5
llama-bench: benchmark 1/7: prompt run 3/5
llama-bench: benchmark 1/7: prompt run 4/5
llama-bench: benchmark 1/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       | 999 |      16 |    1024 |     8192 |  1 | 1.00/1.00/1.00 |    0 |           pp512 |        420.11 ± 2.25 |
llama-bench: benchmark 2/7: starting
llama-bench: benchmark 2/7: warmup prompt run
llama-bench: benchmark 2/7: prompt run 1/5
llama-bench: benchmark 2/7: prompt run 2/5
llama-bench: benchmark 2/7: prompt run 3/5
llama-bench: benchmark 2/7: prompt run 4/5
llama-bench: benchmark 2/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       | 999 |      16 |    1024 |     8192 |  1 | 1.00/1.00/1.00 |    0 |          pp2048 |        570.90 ± 2.68 |
llama-bench: benchmark 3/7: starting
llama-bench: benchmark 3/7: warmup prompt run
llama-bench: benchmark 3/7: prompt run 1/5
llama-bench: benchmark 3/7: prompt run 2/5
llama-bench: benchmark 3/7: prompt run 3/5
llama-bench: benchmark 3/7: prompt run 4/5
llama-bench: benchmark 3/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       | 999 |      16 |    1024 |     8192 |  1 | 1.00/1.00/1.00 |    0 |          pp8192 |        594.73 ± 0.97 |
llama-bench: benchmark 4/7: starting
llama-bench: benchmark 4/7: warmup prompt run
llama-bench: benchmark 4/7: prompt run 1/5
llama-bench: benchmark 4/7: prompt run 2/5
llama-bench: benchmark 4/7: prompt run 3/5
llama-bench: benchmark 4/7: prompt run 4/5
llama-bench: benchmark 4/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       | 999 |      16 |    1024 |     8192 |  1 | 1.00/1.00/1.00 |    0 |         pp16384 |        580.82 ± 0.75 |
llama-bench: benchmark 5/7: starting
llama-bench: benchmark 5/7: warmup prompt run
```

branch:
```c
HIP_VISIBLE_DEVICES=0,1,2 numactl --interleave=all /home/kanken/code/llama.cpp-gfx906/build/bin/llama-bench   -m /mnt/models/Storage/unsloth/Qwen3-Coder-Next-GGUF/Q8_0/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf   -ngl 999   -ts 1/1/1   -fa 1   -b 1024   -ub 8192   -t 16   -mmp 0   -p 512,2048,8192,16384,32768,65536   -n 128   --progress
ggml_cuda_init: found 3 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 2: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | fa | ts           |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | ------------ | --------------: | -------------------: |
llama-bench: benchmark 1/7: starting
llama-bench: benchmark 1/7: warmup prompt run
llama-bench: benchmark 1/7: prompt run 1/5
llama-bench: benchmark 1/7: prompt run 2/5
llama-bench: benchmark 1/7: prompt run 3/5
llama-bench: benchmark 1/7: prompt run 4/5
llama-bench: benchmark 1/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       | 999 |      16 |    1024 |     8192 |  1 | 1.00/1.00/1.00 |           pp512 |        249.01 ± 2.17 |
llama-bench: benchmark 2/7: starting
llama-bench: benchmark 2/7: warmup prompt run
llama-bench: benchmark 2/7: prompt run 1/5
llama-bench: benchmark 2/7: prompt run 2/5
llama-bench: benchmark 2/7: prompt run 3/5
llama-bench: benchmark 2/7: prompt run 4/5
llama-bench: benchmark 2/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       | 999 |      16 |    1024 |     8192 |  1 | 1.00/1.00/1.00 |          pp2048 |        278.43 ± 0.73 |
llama-bench: benchmark 3/7: starting
llama-bench: benchmark 3/7: warmup prompt run
^C
```

This branch's fill perf is worse by nearly 2x, is this just because it lacks some upstream optimizations for this model or is there something wrong with my config here?

### First Bad Commit

_No response_

### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: Low performance against mainline #19

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Misc. bug: Low performance against mainline #19

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions