Eval bug: Huge TG performance regression with b6969 (ROCm)

### Name and Version

./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 ROCm devices:
  Device 0: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 2: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
version: 6969 (aa374175c)
built with AOMP_STANDALONE_22.0_roc7-1 clang version 22.0.0_AOMP_STANDALONE_22.0_roc7-1 (https://github.com/ROCm/llvm-project 5e5ac6bb724fe52fe05a96cd4fee1aea0142d40c) for x86_64-unknown-linux-gnu

### Operating systems

Linux

### GGML backends

HIP

### Hardware

Epyc 7B13 + 3X Radeon Pro VII

### Models

Qwen3-VL-30B-A3B-Instruct Q8_0
Qwen3-VL-32B-Instruct Q8_0
Qwen3-VL-235B-A22B-Instruct Q8_0
GLM-4.6 Q5_K_M


### Problem description & steps to reproduce

Token generation speed is much lower with `b6969` while memory usage increased.
Every model is affected, to a different degree.
`-fa on/off ` doesn't affect the result.





**b6968 VRAM usage**

```
============================================= ROCm System Management Interface =============================================
======================================================= Concise Info =======================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK     MCLK     Fan     Perf    PwrCap  VRAM%  GPU%
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)                                                         
============================================================================================================================
0       3     0x66a1,   28047  66.0°C  128.0W    N/A, N/A, 0         1654Mhz  1000Mhz  30.59%  manual  140.0W  80%    99%
1       1     0x66a1,   40382  54.0°C  132.0W    N/A, N/A, 0         1654Mhz  1000Mhz  30.59%  manual  140.0W  75%    99%
2       2     0x66a1,   52861  67.0°C  133.0W    N/A, N/A, 0         1654Mhz  1000Mhz  30.59%  manual  140.0W  77%    99%
============================================================================================================================
=================================================== End of ROCm SMI Log ====================================================
```




**b6969 VRAM usage**
```
============================================= ROCm System Management Interface =============================================
======================================================= Concise Info =======================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK     MCLK     Fan     Perf    PwrCap  VRAM%  GPU%
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)                                                         
============================================================================================================================
0       3     0x66a1,   28047  63.0°C  139.0W    N/A, N/A, 0         1654Mhz  1000Mhz  30.59%  manual  140.0W  91%    100%
1       1     0x66a1,   40382  52.0°C  121.0W    N/A, N/A, 0         1654Mhz  1000Mhz  30.59%  manual  140.0W  86%    100%
2       2     0x66a1,   52861  62.0°C  129.0W    N/A, N/A, 0         1654Mhz  1000Mhz  30.59%  manual  140.0W  88%    100%
============================================================================================================================
=================================================== End of ROCm SMI Log ====================================================
```

### First Bad Commit

b6969

### Relevant log output

```shell
b6968 llama-bench

llama-bench -m /home/user/text-generation-webui/models/Qwen3-vl-32b/Qwen3-VL-32B-Instruct-UD-Q8_K_XL.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 ROCm devices:
  Device 0: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 2: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3vl 32B Q8_0               |  36.76 GiB |    32.76 B | ROCm       |  99 |           pp512 |        110.06 ± 0.04 |
| qwen3vl 32B Q8_0               |  36.76 GiB |    32.76 B | ROCm       |  99 |           tg128 |         15.32 ± 0.01 |

build: 5b180c3d6 (6968)

llama-bench -m /home/user/text-generation-webui/models/Qwen3-vl-32b/Qwen3-VL-32B-Instruct-UD-Q8_K_XL.gguf -sm row
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 ROCm devices:
  Device 0: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 2: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl |    sm |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| qwen3vl 32B Q8_0               |  36.76 GiB |    32.76 B | ROCm       |  99 |   row |           pp512 |        186.17 ± 0.58 |
| qwen3vl 32B Q8_0               |  36.76 GiB |    32.76 B | ROCm       |  99 |   row |           tg128 |         22.19 ± 0.02 |

build: 5b180c3d6 (6968)

b6969 llama-bench

llama-bench -m /home/user/text-generation-webui/models/Qwen3-vl-32b/Qwen3-VL-32B-Instruct-UD-Q8_K_XL.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 ROCm devices:
  Device 0: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 2: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3vl 32B Q8_0               |  36.76 GiB |    32.76 B | ROCm       |  99 |           pp512 |        110.57 ± 0.05 |
| qwen3vl 32B Q8_0               |  36.76 GiB |    32.76 B | ROCm       |  99 |           tg128 |          8.36 ± 0.01 |

build: aa374175c (6969)

llama-bench -m /home/user/text-generation-webui/models/Qwen3-vl-32b/Qwen3-VL-32B-Instruct-UD-Q8_K_XL.gguf -sm row
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 ROCm devices:
  Device 0: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 2: AMD Radeon Pro VII, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl |    sm |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | --------------: | -------------------: |
| qwen3vl 32B Q8_0               |  36.76 GiB |    32.76 B | ROCm       |  99 |   row |           pp512 |        184.50 ± 1.41 |
| qwen3vl 32B Q8_0               |  36.76 GiB |    32.76 B | ROCm       |  99 |   row |           tg128 |          7.44 ± 0.00 |

build: aa374175c (6969)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Huge TG performance regression with b6969 (ROCm) #17058

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Huge TG performance regression with b6969 (ROCm) #17058

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions