-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Description
Name and Version
[docker@7158e8afaf9c ~]$ llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
version: 6527 (7f76692)
built with cc (GCC) 15.2.1 20250813 for x86_64-pc-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
No response
Command line
llama-batched-bench -m .cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 --no-mmap -c 0 -ntg 128 -npp 512 -npl 1,2,3,4,5,6,7,8Problem description & steps to reproduce
When benching dense models through llama-batched-bench, the vulkan backend shows nice scaling across all batch sizes. Eg, Qwen3-8b q8_0:
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 512 | 128 | 1 | 640 | 0.758 | 675.37 | 4.953 | 25.84 | 5.711 | 112.06 |
| 512 | 128 | 2 | 1280 | 1.382 | 740.84 | 5.058 | 50.61 | 6.440 | 198.75 |
| 512 | 128 | 3 | 1920 | 2.282 | 673.16 | 5.257 | 73.04 | 7.539 | 254.67 |
| 512 | 128 | 4 | 2560 | 2.913 | 702.98 | 5.441 | 94.09 | 8.355 | 306.41 |
| 512 | 128 | 5 | 3200 | 3.684 | 694.80 | 5.593 | 114.43 | 9.277 | 344.93 |
| 512 | 128 | 6 | 3840 | 4.408 | 696.92 | 5.841 | 131.47 | 10.249 | 374.66 |
| 512 | 128 | 7 | 4480 | 5.227 | 685.71 | 6.002 | 149.29 | 11.228 | 398.99 |
| 512 | 128 | 8 | 5120 | 5.935 | 690.16 | 6.202 | 165.11 | 12.137 | 421.85 |
But when trying to same with a MOE model (gpt-oss-120b in this case), there is negative scaling at batch sizes 2 and 3. I know MOE models scale worse as not every sequence will activate the same experts (therefor there can be less weight sharing between sequences), but I would expect some positive improvement as batch size increases, not the current negative scaling:
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 512 | 128 | 1 | 640 | 1.281 | 399.82 | 2.531 | 50.58 | 3.811 | 167.92 |
| 512 | 128 | 2 | 1280 | 2.527 | 405.27 | 7.296 | 35.09 | 9.823 | 130.31 |
| 512 | 128 | 3 | 1920 | 3.879 | 395.98 | 8.605 | 44.62 | 12.484 | 153.79 |
| 512 | 128 | 4 | 2560 | 4.960 | 412.93 | 9.623 | 53.21 | 14.582 | 175.55 |
| 512 | 128 | 5 | 3200 | 6.187 | 413.78 | 10.704 | 59.79 | 16.891 | 189.45 |
| 512 | 128 | 6 | 3840 | 7.419 | 414.05 | 11.554 | 66.47 | 18.974 | 202.39 |
| 512 | 128 | 7 | 4480 | 8.851 | 404.92 | 12.547 | 71.41 | 21.398 | 209.36 |
| 512 | 128 | 8 | 5120 | 9.971 | 410.79 | 13.604 | 75.27 | 23.575 | 217.18 |
First Bad Commit
No response