Skip to content

Conversation

@lksj92hs
Copy link
Contributor

Currently, --n-cpu-moe takes a single parameter N: the number of MOE layers to offload to CPU. This is interpreted as a range 0, N-1. This works well enough when using a single GPU but there are some major complications when multiple GPUs are used. For example, when using a model like unsloth/Qwen3-30B-A3B-Instruct-2507-Q2_K on two Nvidia 3090s:

  • Layers 0 to 47 are placed on GPU0 and layers 48-94 are placed on GPU1. However with --n-cpu-moe 44, this means that all MOE layers that are offloaded are from the first GPU, which leaves its VRAM underutilized while the VRAM of the second GPU is not enough and we get an out of memory error.
  • Currently the only way to fix this and maximize the VRAM utilization of both GPUs is to force different layer placement with the -ts argument. We need to try to start e.g. llama-server multiple times with different -ts to see which combination will utilize the best VRAM of both GPUs. In this case the best option is -ts 5,2, which will place layers 0-65 on GPU0 and layers 65-94 on GPU1.

There are two problems with this:

  1. Finding the best --n-cpu-moe and --ts values can be quite time-consuming, especially with large models which load relatively slowly.
  2. Even when we find the best combination and both GPUs have their VRAM usage reasonably optimized, still all the offloaded MOE layers are on the first GPU, and the second GPU has significantly less layers (but with all tensors), which in my tests leads to less than optimal performance (see bellow for the results).

The proposed change allows specifying layer ranges in --n-cpu-moe like this:

  • --n-cpu-moe N offloads MOE layers 0 to N-1 to the CPU (the same as the current behavior)
  • --n-cpu-moe N1-N2,N3-N4 offloads MOE layers N1 to N2, and N3 to N4 to the CPU and allows distributing the layers more evenly between the GPUs
  • --n-cpu-moe N1-N2,N3 - if there is a comma and a single value is specified, it is treated as a layer index to be offloaded, so in this case MOE layers N1 to N2, and layer N3 will be offloaded to the CPU

Whit these changes, we can use --n-cpu-moe 0-22,48-68, which still offloads 44 MOE layers to the CPU but without the non-intuitive -ts 5,2 and the result is that layers 0-47 (with 23 MOE layers offloaded) are on GPU0, and layers 48-94 (with 21 MOE layers offloaded) are on GPU1, which is much more even distribution than before.

Here are the test results on the following machine:

Motherboard: Gigabyte Technology Co., Ltd. X299X AORUS MASTER Default string
BIOS: American Megatrends Inc. F3m from 12/06/2021
2x NVIDIA GeForce RTX 3090 [CUDA] PCIe 3.0 x16 + NVLink PL 350.00 W 38C Driver 550.144.03
Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz (28 thr)
252 GB RAM 3000 MT/s (8 DIMMs, 4 channels) (2% used)

Results with --n-cpu-moe 44 --ts 5,2:

pp512: 3262.99ms 156.9 t/s, tg128: 9.77 t/s, avg: 7.82 t/s
pp1024: 6638.63ms 154.2 t/s, tg128: 9.75 t/s, avg: 6.48 t/s
pp2048: 12987.12ms 157.7 t/s, tg128: 9.72 t/s, avg: 4.89 t/s
pp3968: 26079.23ms 152.2 t/s, tg128: 9.70 t/s, avg: 3.26 t/s

Results with --n-cpu-moe 0-22,48-68:

pp512: 3302.07ms 155.1 t/s, tg128: 13.82 t/s, avg: 10.19 t/s
pp1024: 6778.08ms 151.1 t/s, tg128: 13.80 t/s, avg: 7.97 t/s
pp2048: 13328.80ms 153.7 t/s, tg128: 13.84 t/s, avg: 5.67 t/s
pp3968: 26616.76ms 149.1 t/s, tg128: 13.63 t/s, avg: 3.55 t/s

@jacekpoplawski
Copy link
Contributor

Will this new syntax work with #15952? I mean, how could we pass multiple setups this way (comma)?

@lksj92hs
Copy link
Contributor Author

I think that these changes are largely compatible except the ability to use the new syntax in llama-bench - as it is, llama-bench will only support the current format, which is fine by me but a little restrictive.

Alternatives:

  • Add a new parameter named e.g. --cpu-moe-layers instead of reusing --n-cpu-moe. I don't like this one because there will be much more changes in the code and confusion when both arguments are specified.
  • Use the new syntax only for llama-cli and llama-server, leaving the old syntax for llama-bench: easiest solution.
  • Use the new syntax in llama-bench too - requires some changes in llama-bench: add --n-cpu-moe support #15952 and the commas will be ambiguous but this can be solved by either disabling the ability to specify multiple instances of --n-cpu-moe in llama-bench, or by adding alternative divider (e.g. :) in this PR to be able to specify something like --n-cpu--moe 0-23:48-68 in llama-bench, while still being able to use --n-cpu--moe 0-23,48-68 (and --n-cpu--moe 0-23:48-68) in the other CLI tools.

Resolved merge conflicts
@lksj92hs lksj92hs force-pushed the cli-n-cpu-moe-allow-layer-groups branch from 5e2e03a to f11f6d6 Compare September 17, 2025 02:48
@lksj92hs
Copy link
Contributor Author

lksj92hs commented Sep 17, 2025

OK, the current version is using the new syntax (which is fully backwards compatible) in llama-cli and llama-server, leaving the old syntax for llama-bench. The merge conflict with accepted PR #15952 is resolved

@lksj92hs lksj92hs changed the title cli: allow layer groups in --n-cpu-moe llama-server: allow layer groups in --n-cpu-moe Sep 17, 2025
@jacekpoplawski
Copy link
Contributor

I think your change will be very useful. I’m forced to use -ts myself (I use 3 GPUs), so that will solve my pain.

However, the idea behind the llama-bench change was to make it work just like llama-server, so now there will be a difference again.

I was thinking about it... maybe we could use a formula instead of raw ranges? Something like “offload N layers per GPU”? But that assumes each layer is the same size and each GPU has the same amount of VRAM.

Or maybe we could just use a different delimiter than a comma from the start.

@lksj92hs lksj92hs closed this Sep 23, 2025
@lksj92hs lksj92hs deleted the cli-n-cpu-moe-allow-layer-groups branch September 23, 2025 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants