Skip to content

Misc. bug: Low performance against mainline #19

@Kanken6174

Description

@Kanken6174

Name and Version

kanken@loom:~$ /home/kanken/code/llama.cpp-gfx906/build/bin/llama-cli --version
ggml_cuda_init: found 4 ROCm devices:
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 2: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 3: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
version: 7973 (6091bf8)
built with Clang 18.0.0 for Linux x86_64

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

No response

Command line

Problem description & steps to reproduce

Hey all, i've been testing this out against mainline on my 3x MI50s, and the numbers i got are... strange.
mainline:

kanken@loom:~$ HIP_VISIBLE_DEVICES=0,1,2 \
numactl --interleave=all \
/home/kanken/code/llama.cpp/build/bin/llama-bench \
  -m /mnt/models/Storage/unsloth/Qwen3-Coder-Next-GGUF/Q8_0/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf \
  -ngl 999 \
  -ts 1/1/1 \
  -fa 1 \
  -b 1024 \
  -ub 8192 \
  -t 16 \
  -mmp 0 \
  -p 512,2048,8192,16384,32768,65536 \
  -n 128 \
  --progress
ggml_cuda_init: found 3 ROCm devices (Total VRAM: 98256 MiB):
  Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB (32724 MiB free)
  Device 1: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB (32724 MiB free)
  Device 2: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB (32724 MiB free)
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | fa | ts           | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | ------------ | ---: | --------------: | -------------------: |
llama-bench: benchmark 1/7: starting
llama-bench: benchmark 1/7: warmup prompt run
llama-bench: benchmark 1/7: prompt run 1/5
llama-bench: benchmark 1/7: prompt run 2/5
llama-bench: benchmark 1/7: prompt run 3/5
llama-bench: benchmark 1/7: prompt run 4/5
llama-bench: benchmark 1/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       | 999 |      16 |    1024 |     8192 |  1 | 1.00/1.00/1.00 |    0 |           pp512 |        420.11 ± 2.25 |
llama-bench: benchmark 2/7: starting
llama-bench: benchmark 2/7: warmup prompt run
llama-bench: benchmark 2/7: prompt run 1/5
llama-bench: benchmark 2/7: prompt run 2/5
llama-bench: benchmark 2/7: prompt run 3/5
llama-bench: benchmark 2/7: prompt run 4/5
llama-bench: benchmark 2/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       | 999 |      16 |    1024 |     8192 |  1 | 1.00/1.00/1.00 |    0 |          pp2048 |        570.90 ± 2.68 |
llama-bench: benchmark 3/7: starting
llama-bench: benchmark 3/7: warmup prompt run
llama-bench: benchmark 3/7: prompt run 1/5
llama-bench: benchmark 3/7: prompt run 2/5
llama-bench: benchmark 3/7: prompt run 3/5
llama-bench: benchmark 3/7: prompt run 4/5
llama-bench: benchmark 3/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       | 999 |      16 |    1024 |     8192 |  1 | 1.00/1.00/1.00 |    0 |          pp8192 |        594.73 ± 0.97 |
llama-bench: benchmark 4/7: starting
llama-bench: benchmark 4/7: warmup prompt run
llama-bench: benchmark 4/7: prompt run 1/5
llama-bench: benchmark 4/7: prompt run 2/5
llama-bench: benchmark 4/7: prompt run 3/5
llama-bench: benchmark 4/7: prompt run 4/5
llama-bench: benchmark 4/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       | 999 |      16 |    1024 |     8192 |  1 | 1.00/1.00/1.00 |    0 |         pp16384 |        580.82 ± 0.75 |
llama-bench: benchmark 5/7: starting
llama-bench: benchmark 5/7: warmup prompt run

branch:

HIP_VISIBLE_DEVICES=0,1,2 numactl --interleave=all /home/kanken/code/llama.cpp-gfx906/build/bin/llama-bench   -m /mnt/models/Storage/unsloth/Qwen3-Coder-Next-GGUF/Q8_0/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf   -ngl 999   -ts 1/1/1   -fa 1   -b 1024   -ub 8192   -t 16   -mmp 0   -p 512,2048,8192,16384,32768,65536   -n 128   --progress
ggml_cuda_init: found 3 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 2: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | fa | ts           |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | ------------ | --------------: | -------------------: |
llama-bench: benchmark 1/7: starting
llama-bench: benchmark 1/7: warmup prompt run
llama-bench: benchmark 1/7: prompt run 1/5
llama-bench: benchmark 1/7: prompt run 2/5
llama-bench: benchmark 1/7: prompt run 3/5
llama-bench: benchmark 1/7: prompt run 4/5
llama-bench: benchmark 1/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       | 999 |      16 |    1024 |     8192 |  1 | 1.00/1.00/1.00 |           pp512 |        249.01 ± 2.17 |
llama-bench: benchmark 2/7: starting
llama-bench: benchmark 2/7: warmup prompt run
llama-bench: benchmark 2/7: prompt run 1/5
llama-bench: benchmark 2/7: prompt run 2/5
llama-bench: benchmark 2/7: prompt run 3/5
llama-bench: benchmark 2/7: prompt run 4/5
llama-bench: benchmark 2/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | ROCm       | 999 |      16 |    1024 |     8192 |  1 | 1.00/1.00/1.00 |          pp2048 |        278.43 ± 0.73 |
llama-bench: benchmark 3/7: starting
llama-bench: benchmark 3/7: warmup prompt run
^C

This branch's fill perf is worse by nearly 2x, is this just because it lacks some upstream optimizations for this model or is there something wrong with my config here?

First Bad Commit

No response

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions