Skip to content

Conversation

@jacekpoplawski
Copy link
Contributor

@jacekpoplawski jacekpoplawski commented Sep 12, 2025

followup to #15077

This changeset addresses three items (I can split it into atomic commits if needed):

  • Support --n-cpu-moe in llama-bench the same way it is supported by llama-server.

  • Fix the table by trimming tensor_buft_overrides output in markdown_printer.

  • Refactor duplicated ffn_(up|down|gate)_exps regex into helper functions.

tested on Windows:

PS C:\Users\jacek\git\llama.cpp> .\build_2025.09.12\bin\Release\llama-bench.exe -m J:\llm\models\Qwen3-30B-A3B-Instruct-2507-Q3_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |           pp512 |        378.46 ± 1.43 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |           tg128 |         11.09 ± 0.00 |

build: 08a216474 (6457)
PS C:\Users\jacek\git\llama.cpp> .\build_2025.09.12\bin\Release\llama-bench.exe -m J:\llm\models\Qwen3-30B-A3B-Instruct-2507-Q3_K_M.gguf --n-cpu-moe 12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |                                       ot |            test |                  t/s |     
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------------------------------------: | --------------: | -------------------: |     
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 | blk\.0\.ffn_(up|down|gate)_exps=CPU;blk\ |           pp512 |      1253.62 ± 20.28 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 | blk\.0\.ffn_(up|down|gate)_exps=CPU;blk\ |           tg128 |         54.04 ± 0.27 |

build: 08a216474 (6457)

@pwilkin
Copy link
Collaborator

pwilkin commented Sep 12, 2025

Nice, I had this on my TODO list as well :)

@pwilkin pwilkin self-assigned this Sep 12, 2025
Copy link
Collaborator

@pwilkin pwilkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but would be nice if this had --cpu-moe support as well so that we can get it done in one go.


if (field == "tensor_buft_overrides") {
if (value.size() > width)
value.erase(width);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have the typical behavior of "strip last 3 characters + add '...'" instead to make it clear to the user we're truncating?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about several ways to handle it:

  • simple cut
  • “nice” cut (with “…”)
  • using a special string like “-” instead of the value
  • renaming the “ot” column to “ncmoe”
  • removing the “ot” column

I went with the simplest option just to start the discussion.
I think the last option (skipping that column) is the best.
However, since someone added “-ot” to llama-bench, I don’t want to introduce a regression.
Still, if you look at the table without the cut, it only works correctly for short strings.

Is this column useful? The way I implemented --n-cpu-moe, it can have only one value, so each row will have the same value in that column.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, final version: I think we should unify how parameters to llama-bench behave:

  • make --n-cpu-moe a vector-type parameter, similar to all the other switches (like -fa, -p, -n and the like), so that it's suitable for comparison
  • keep -ot for now, but:
  • move all the parameters that don't take multiple options, i.e. don't take part in the comparison, above the table. Should look something like this:
PS C:\Users\jacek\git\llama.cpp> .\build_2025.09.12\bin\Release\llama-bench.exe -m J:\llm\models\Qwen3-30B-A3B-Instruct-2507-Q3_K_M.gguf --n-cpu-moe 12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
model: qwen3moe 30B.A3B Q3_K - Medium
model size: 13.70 GiB
model parameters: 30.53 B
override_tensors: blk\.0\.ffn_(up|down|gate)_exps=CPU;blk... (full expression here if actually ot was set)

| backend | ngl | test         | t/s                          |
| CUDA     | 99  | pp512     | 1253.62 ± 20.28    |
| CUDA     | 99  | tg128      | 54.04 ± 0.27          |

This way, all parameters that are actually part of the benchmark are in the table, while the table isn't cluttered with repeated values that don't actually change during the test.

(if you don't want to implement all of this, just do the --n-cpu-moe as a vector part and move override_tensors above and we can add another PR for the rest)

@ggerganov @slaren @CISC what do you think? does this sound like a reasonable idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n-cpu-moe should be a different parameter and shown as its own column in the markdown output, not just merged with the rest of the overrides.

Copy link
Contributor Author

@jacekpoplawski jacekpoplawski Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n-cpu-moe should be a different parameter and shown as its own column in the markdown output, not just merged with the rest of the overrides.

I agree that this would be the ideal behavior, and I will try to implement it that way.
n_cpu_moe will probably be part of struct test (not just an alias for ot in the argument parser), so the change will be bigger.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS C:\Users\jacek\git\llama.cpp> .\build_2025.09.12\bin\Release\llama-bench.exe -m J:\llm\models\Qwen3-30B-A3B-Instruct-2507-Q3_K_M.gguf --n-cpu-moe 0,6,12,18
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |  n_cpu_moe |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |          0 |           pp512 |        433.44 + 2.62 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |          0 |           tg128 |         10.85 + 0.00 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |          6 |           pp512 |        388.82 + 8.21 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |          6 |           tg128 |         13.56 + 0.01 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |         12 |           pp512 |      1239.89 + 11.49 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |         12 |           tg128 |         51.03 + 1.88 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |         18 |           pp512 |       652.27 + 20.82 |
| qwen3moe 30B.A3B Q3_K - Medium |  13.70 GiB |    30.53 B | CUDA       |  99 |         18 |           tg128 |         33.97 + 1.17 |

build: c0bb1ae67 (6457)

}

int width = get_field_width(field);
unsigned int width = get_field_width(field);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to change this to unsigned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to fix the CI (but it’s still failing on mac for an unrelated reason)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI issue is not related. Please revert this change if it is not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Support --n-cpu-moe in llama-bench the same way it is supported by
llama-server.

Refactor duplicated ffn_(up|down|gate)_exps regex into helper functions.
@slaren
Copy link
Member

slaren commented Sep 16, 2025

Looks good, but would be nice if this had --cpu-moe support as well so that we can get it done in one go.

Maybe we can add --cpu-moe 0|1 later as an alias to -n-cpu-moe 0|999 or similar in the future, but for now this should be good.

@slaren slaren merged commit 8ff2060 into ggml-org:master Sep 16, 2025
45 checks passed
angt pushed a commit to angt/llama.cpp that referenced this pull request Sep 17, 2025
* llama-bench: add --n-cpu-moe support

Support --n-cpu-moe in llama-bench the same way it is supported by
llama-server.
@jacekpoplawski
Copy link
Contributor Author

looks like it's actually useful :)
https://www.reddit.com/r/LocalLLaMA/comments/1nt2c38/llamacpp_moe_models_find_best_ncpumoe_value/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants