Skip to content

Misc. bug: Vulkan backend shows negative scaling at low batch sizes with MOE models #16134

@Mushoz

Description

@Mushoz

Name and Version

[docker@7158e8afaf9c ~]$ llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
version: 6527 (7f76692)
built with cc (GCC) 15.2.1 20250813 for x86_64-pc-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

No response

Command line

llama-batched-bench -m .cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 --no-mmap -c 0 -ntg 128 -npp 512 -npl 1,2,3,4,5,6,7,8

Problem description & steps to reproduce

When benching dense models through llama-batched-bench, the vulkan backend shows nice scaling across all batch sizes. Eg, Qwen3-8b q8_0:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    0.758 |   675.37 |    4.953 |    25.84 |    5.711 |   112.06 |
|   512 |    128 |    2 |   1280 |    1.382 |   740.84 |    5.058 |    50.61 |    6.440 |   198.75 |
|   512 |    128 |    3 |   1920 |    2.282 |   673.16 |    5.257 |    73.04 |    7.539 |   254.67 |
|   512 |    128 |    4 |   2560 |    2.913 |   702.98 |    5.441 |    94.09 |    8.355 |   306.41 |
|   512 |    128 |    5 |   3200 |    3.684 |   694.80 |    5.593 |   114.43 |    9.277 |   344.93 |
|   512 |    128 |    6 |   3840 |    4.408 |   696.92 |    5.841 |   131.47 |   10.249 |   374.66 |
|   512 |    128 |    7 |   4480 |    5.227 |   685.71 |    6.002 |   149.29 |   11.228 |   398.99 |
|   512 |    128 |    8 |   5120 |    5.935 |   690.16 |    6.202 |   165.11 |   12.137 |   421.85 |

But when trying to same with a MOE model (gpt-oss-120b in this case), there is negative scaling at batch sizes 2 and 3. I know MOE models scale worse as not every sequence will activate the same experts (therefor there can be less weight sharing between sequences), but I would expect some positive improvement as batch size increases, not the current negative scaling:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |    128 |    1 |    640 |    1.281 |   399.82 |    2.531 |    50.58 |    3.811 |   167.92 |
|   512 |    128 |    2 |   1280 |    2.527 |   405.27 |    7.296 |    35.09 |    9.823 |   130.31 |
|   512 |    128 |    3 |   1920 |    3.879 |   395.98 |    8.605 |    44.62 |   12.484 |   153.79 |
|   512 |    128 |    4 |   2560 |    4.960 |   412.93 |    9.623 |    53.21 |   14.582 |   175.55 |
|   512 |    128 |    5 |   3200 |    6.187 |   413.78 |   10.704 |    59.79 |   16.891 |   189.45 |
|   512 |    128 |    6 |   3840 |    7.419 |   414.05 |   11.554 |    66.47 |   18.974 |   202.39 |
|   512 |    128 |    7 |   4480 |    8.851 |   404.92 |   12.547 |    71.41 |   21.398 |   209.36 |
|   512 |    128 |    8 |   5120 |    9.971 |   410.79 |   13.604 |    75.27 |   23.575 |   217.18 |

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions