-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Name and Version
kanken@loom:~$ /home/kanken/code/llama.cpp-gfx906/build/bin/llama-cli --version
ggml_cuda_init: found 4 ROCm devices:
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 2: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 3: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
version: 7973 (6091bf8)
built with Clang 18.0.0 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
No response
Command line
Problem description & steps to reproduce
Hey all, i've been testing this out against mainline on my 3x MI50s, and the numbers i got are... strange.
mainline:
kanken@loom:~$ HIP_VISIBLE_DEVICES=0,1,2 \
numactl --interleave=all \
/home/kanken/code/llama.cpp/build/bin/llama-bench \
-m /mnt/models/Storage/unsloth/Qwen3-Coder-Next-GGUF/Q8_0/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf \
-ngl 999 \
-ts 1/1/1 \
-fa 1 \
-b 1024 \
-ub 8192 \
-t 16 \
-mmp 0 \
-p 512,2048,8192,16384,32768,65536 \
-n 128 \
--progress
ggml_cuda_init: found 3 ROCm devices (Total VRAM: 98256 MiB):
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB (32724 MiB free)
Device 1: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB (32724 MiB free)
Device 2: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64, VRAM: 32752 MiB (32724 MiB free)
| model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | ts | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | ------------ | ---: | --------------: | -------------------: |
llama-bench: benchmark 1/7: starting
llama-bench: benchmark 1/7: warmup prompt run
llama-bench: benchmark 1/7: prompt run 1/5
llama-bench: benchmark 1/7: prompt run 2/5
llama-bench: benchmark 1/7: prompt run 3/5
llama-bench: benchmark 1/7: prompt run 4/5
llama-bench: benchmark 1/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 999 | 16 | 1024 | 8192 | 1 | 1.00/1.00/1.00 | 0 | pp512 | 420.11 ± 2.25 |
llama-bench: benchmark 2/7: starting
llama-bench: benchmark 2/7: warmup prompt run
llama-bench: benchmark 2/7: prompt run 1/5
llama-bench: benchmark 2/7: prompt run 2/5
llama-bench: benchmark 2/7: prompt run 3/5
llama-bench: benchmark 2/7: prompt run 4/5
llama-bench: benchmark 2/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 999 | 16 | 1024 | 8192 | 1 | 1.00/1.00/1.00 | 0 | pp2048 | 570.90 ± 2.68 |
llama-bench: benchmark 3/7: starting
llama-bench: benchmark 3/7: warmup prompt run
llama-bench: benchmark 3/7: prompt run 1/5
llama-bench: benchmark 3/7: prompt run 2/5
llama-bench: benchmark 3/7: prompt run 3/5
llama-bench: benchmark 3/7: prompt run 4/5
llama-bench: benchmark 3/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 999 | 16 | 1024 | 8192 | 1 | 1.00/1.00/1.00 | 0 | pp8192 | 594.73 ± 0.97 |
llama-bench: benchmark 4/7: starting
llama-bench: benchmark 4/7: warmup prompt run
llama-bench: benchmark 4/7: prompt run 1/5
llama-bench: benchmark 4/7: prompt run 2/5
llama-bench: benchmark 4/7: prompt run 3/5
llama-bench: benchmark 4/7: prompt run 4/5
llama-bench: benchmark 4/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 999 | 16 | 1024 | 8192 | 1 | 1.00/1.00/1.00 | 0 | pp16384 | 580.82 ± 0.75 |
llama-bench: benchmark 5/7: starting
llama-bench: benchmark 5/7: warmup prompt runbranch:
HIP_VISIBLE_DEVICES=0,1,2 numactl --interleave=all /home/kanken/code/llama.cpp-gfx906/build/bin/llama-bench -m /mnt/models/Storage/unsloth/Qwen3-Coder-Next-GGUF/Q8_0/Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf -ngl 999 -ts 1/1/1 -fa 1 -b 1024 -ub 8192 -t 16 -mmp 0 -p 512,2048,8192,16384,32768,65536 -n 128 --progress
ggml_cuda_init: found 3 ROCm devices:
Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 2: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
| model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | ts | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | ------------ | --------------: | -------------------: |
llama-bench: benchmark 1/7: starting
llama-bench: benchmark 1/7: warmup prompt run
llama-bench: benchmark 1/7: prompt run 1/5
llama-bench: benchmark 1/7: prompt run 2/5
llama-bench: benchmark 1/7: prompt run 3/5
llama-bench: benchmark 1/7: prompt run 4/5
llama-bench: benchmark 1/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 999 | 16 | 1024 | 8192 | 1 | 1.00/1.00/1.00 | pp512 | 249.01 ± 2.17 |
llama-bench: benchmark 2/7: starting
llama-bench: benchmark 2/7: warmup prompt run
llama-bench: benchmark 2/7: prompt run 1/5
llama-bench: benchmark 2/7: prompt run 2/5
llama-bench: benchmark 2/7: prompt run 3/5
llama-bench: benchmark 2/7: prompt run 4/5
llama-bench: benchmark 2/7: prompt run 5/5
| qwen3next 80B.A3B Q8_0 | 78.98 GiB | 79.67 B | ROCm | 999 | 16 | 1024 | 8192 | 1 | 1.00/1.00/1.00 | pp2048 | 278.43 ± 0.73 |
llama-bench: benchmark 3/7: starting
llama-bench: benchmark 3/7: warmup prompt run
^CThis branch's fill perf is worse by nearly 2x, is this just because it lacks some upstream optimizations for this model or is there something wrong with my config here?
First Bad Commit
No response
Relevant log output
No response