llama-bench: add --n-cpu-moe support #15952

jacekpoplawski · 2025-09-12T16:13:45Z

followup to #15077

This changeset addresses three items (I can split it into atomic commits if needed):

Support --n-cpu-moe in llama-bench the same way it is supported by llama-server.
Fix the table by trimming tensor_buft_overrides output in markdown_printer.
Refactor duplicated ffn_(up|down|gate)_exps regex into helper functions.

tested on Windows:

PS C:\Users\jacek\git\llama.cpp> .\build_2025.09.12\bin\Release\llama-bench.exe -m J:\llm\models\Qwen3-30B-A3B-Instruct-2507-Q3_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |           pp512 |        378.46 ± 1.43 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |           tg128 |         11.09 ± 0.00 |

build: 08a216474 (6457)
PS C:\Users\jacek\git\llama.cpp> .\build_2025.09.12\bin\Release\llama-bench.exe -m J:\llm\models\Qwen3-30B-A3B-Instruct-2507-Q3_K_M.gguf --n-cpu-moe 12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |                                       ot |            test |                  t/s |     
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------------------------------------: | --------------: | -------------------: |     
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 | blk\.0\.ffn_(up|down|gate)_exps=CPU;blk\ |           pp512 |      1253.62 ± 20.28 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 | blk\.0\.ffn_(up|down|gate)_exps=CPU;blk\ |           tg128 |         54.04 ± 0.27 |

build: 08a216474 (6457)

pwilkin · 2025-09-12T22:26:21Z

Nice, I had this on my TODO list as well :)

pwilkin

Looks good, but would be nice if this had --cpu-moe support as well so that we can get it done in one go.

tools/llama-bench/llama-bench.cpp

pwilkin · 2025-09-12T22:29:35Z

tools/llama-bench/llama-bench.cpp

+
+            if (field == "tensor_buft_overrides") {
+                if (value.size() > width)
+                    value.erase(width);


Can we have the typical behavior of "strip last 3 characters + add '...'" instead to make it clear to the user we're truncating?

I was thinking about several ways to handle it:

simple cut

“nice” cut (with “…”)

using a special string like “-” instead of the value

renaming the “ot” column to “ncmoe”

removing the “ot” column

I went with the simplest option just to start the discussion.
I think the last option (skipping that column) is the best.
However, since someone added “-ot” to llama-bench, I don’t want to introduce a regression.
Still, if you look at the table without the cut, it only works correctly for short strings.

Is this column useful? The way I implemented --n-cpu-moe, it can have only one value, so each row will have the same value in that column.

Okay, final version: I think we should unify how parameters to llama-bench behave:

make --n-cpu-moe a vector-type parameter, similar to all the other switches (like -fa, -p, -n and the like), so that it's suitable for comparison

keep -ot for now, but:

move all the parameters that don't take multiple options, i.e. don't take part in the comparison, above the table. Should look something like this:

PS C:\Users\jacek\git\llama.cpp> .\build_2025.09.12\bin\Release\llama-bench.exe -m J:\llm\models\Qwen3-30B-A3B-Instruct-2507-Q3_K_M.gguf --n-cpu-moe 12 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: model: qwen3moe 30B.A3B Q3_K - Medium model size: 13.70 GiB model parameters: 30.53 B override_tensors: blk\.0\.ffn_(up|down|gate)_exps=CPU;blk... (full expression here if actually ot was set) | backend | ngl | test | t/s | | CUDA | 99 | pp512 | 1253.62 ± 20.28 | | CUDA | 99 | tg128 | 54.04 ± 0.27 |

This way, all parameters that are actually part of the benchmark are in the table, while the table isn't cluttered with repeated values that don't actually change during the test.

(if you don't want to implement all of this, just do the --n-cpu-moe as a vector part and move override_tensors above and we can add another PR for the rest)

@ggerganov @slaren @CISC what do you think? does this sound like a reasonable idea?

n-cpu-moe should be a different parameter and shown as its own column in the markdown output, not just merged with the rest of the overrides.

n-cpu-moe should be a different parameter and shown as its own column in the markdown output, not just merged with the rest of the overrides.

I agree that this would be the ideal behavior, and I will try to implement it that way.
n_cpu_moe will probably be part of struct test (not just an alias for ot in the argument parser), so the change will be bigger.

PS C:\Users\jacek\git\llama.cpp> .\build_2025.09.12\bin\Release\llama-bench.exe -m J:\llm\models\Qwen3-30B-A3B-Instruct-2507-Q3_K_M.gguf --n-cpu-moe 0,6,12,18 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5070, compute capability 12.0, VMM: yes | model | size | params | backend | ngl | n_cpu_moe | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | --------------: | -------------------: | | qwen3moe 30B.A3B Q3_K - Medium | 13.70 GiB | 30.53 B | CUDA | 99 | 0 | pp512 | 433.44 + 2.62 | | qwen3moe 30B.A3B Q3_K - Medium | 13.70 GiB | 30.53 B | CUDA | 99 | 0 | tg128 | 10.85 + 0.00 | | qwen3moe 30B.A3B Q3_K - Medium | 13.70 GiB | 30.53 B | CUDA | 99 | 6 | pp512 | 388.82 + 8.21 | | qwen3moe 30B.A3B Q3_K - Medium | 13.70 GiB | 30.53 B | CUDA | 99 | 6 | tg128 | 13.56 + 0.01 | | qwen3moe 30B.A3B Q3_K - Medium | 13.70 GiB | 30.53 B | CUDA | 99 | 12 | pp512 | 1239.89 + 11.49 | | qwen3moe 30B.A3B Q3_K - Medium | 13.70 GiB | 30.53 B | CUDA | 99 | 12 | tg128 | 51.03 + 1.88 | | qwen3moe 30B.A3B Q3_K - Medium | 13.70 GiB | 30.53 B | CUDA | 99 | 18 | pp512 | 652.27 + 20.82 | | qwen3moe 30B.A3B Q3_K - Medium | 13.70 GiB | 30.53 B | CUDA | 99 | 18 | tg128 | 33.97 + 1.17 | build: c0bb1ae67 (6457)

slaren · 2025-09-16T13:15:54Z

tools/llama-bench/llama-bench.cpp

            }

-            int width = get_field_width(field);
+            unsigned int width = get_field_width(field);


What's the reason to change this to unsigned?

I tried to fix the CI (but it’s still failing on mac for an unrelated reason)

The CI issue is not related. Please revert this change if it is not necessary.

Support --n-cpu-moe in llama-bench the same way it is supported by llama-server. Refactor duplicated ffn_(up|down|gate)_exps regex into helper functions.

common/common.h

slaren · 2025-09-16T14:16:24Z

Looks good, but would be nice if this had --cpu-moe support as well so that we can get it done in one go.

Maybe we can add --cpu-moe 0|1 later as an alias to -n-cpu-moe 0|999 or similar in the future, but for now this should be good.

* llama-bench: add --n-cpu-moe support Support --n-cpu-moe in llama-bench the same way it is supported by llama-server.

jacekpoplawski · 2025-09-30T10:05:34Z

looks like it's actually useful :)
https://www.reddit.com/r/LocalLLaMA/comments/1nt2c38/llamacpp_moe_models_find_best_ncpumoe_value/

jacekpoplawski force-pushed the llama-bench-n-cpu-moe branch from 08a2164 to c7a04f2 Compare September 12, 2025 16:28

github-actions bot added the examples label Sep 12, 2025

pwilkin self-assigned this Sep 12, 2025

pwilkin suggested changes Sep 12, 2025

View reviewed changes

jacekpoplawski force-pushed the llama-bench-n-cpu-moe branch from c7a04f2 to 9de1592 Compare September 14, 2025 03:56

jacekpoplawski mentioned this pull request Sep 14, 2025

llama-server: allow layer groups in --n-cpu-moe #15975

Closed

jacekpoplawski requested a review from pwilkin September 14, 2025 16:49

slaren reviewed Sep 16, 2025

View reviewed changes

llama-bench: add --n-cpu-moe support

33d9591

Support --n-cpu-moe in llama-bench the same way it is supported by llama-server. Refactor duplicated ffn_(up|down|gate)_exps regex into helper functions.

jacekpoplawski force-pushed the llama-bench-n-cpu-moe branch from 9de1592 to 33d9591 Compare September 16, 2025 13:24

slaren approved these changes Sep 16, 2025

View reviewed changes

common/common.h Outdated Show resolved Hide resolved

Update common/common.h

e4c38d2

slaren merged commit 8ff2060 into ggml-org:master Sep 16, 2025
45 checks passed

angt pushed a commit to angt/llama.cpp that referenced this pull request Sep 17, 2025

llama-bench: add --n-cpu-moe support (ggml-org#15952)

00e9916

* llama-bench: add --n-cpu-moe support Support --n-cpu-moe in llama-bench the same way it is supported by llama-server.

llama-bench: add --n-cpu-moe support #15952

llama-bench: add --n-cpu-moe support #15952

Uh oh!

Conversation

jacekpoplawski commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Sep 12, 2025

Uh oh!

pwilkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacekpoplawski Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

slaren commented Sep 16, 2025

Uh oh!

Uh oh!

jacekpoplawski commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jacekpoplawski commented Sep 12, 2025 •

edited

Loading

jacekpoplawski Sep 13, 2025 •

edited

Loading