Skip to content

Conversation

@slaren
Copy link
Member

@slaren slaren commented Aug 4, 2025

Following @jacekpoplawski suggestion in #14992, adds an option to keeps the MoE weights of the first N layers in the CPU. You can use:

  • --cpu-moe to keep all MoE weights in the CPU
  • --n-cpu-moe N to keep the MoE weights of the first N layers in the CPU

The goal is to avoid having to write complex regular expressions when trying to optimize the number of MoE layers to keep in the CPU.

These options work by adding the necessary tensor overrides. If you use --override-tensor before these options, your overrides will take priority.

slaren added 2 commits August 4, 2025 23:41
Keeps the MoE weights of the first N layers in the CPU
adding a destructor to common_params would cause issues when the object is copied
@slaren slaren merged commit ec428b0 into master Aug 4, 2025
45 of 47 checks passed
@slaren slaren deleted the sl/ncmoe branch August 4, 2025 23:05
@jacekpoplawski
Copy link
Contributor

Thank you :)

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Aug 5, 2025
* llama : add --n-cpu-moe option

Keeps the MoE weights of the first N layers in the CPU
@SlavikCA
Copy link

SlavikCA commented Aug 5, 2025

Should this options be added to this page, too:
https://github.com/ggml-org/llama.cpp/tree/master/tools/server
?

thad0ctor added a commit to thad0ctor/llama-server-launcher that referenced this pull request Aug 6, 2025
--cpu-moe to keep all MoE weights in the CPU
--n-cpu-moe N to keep the MoE weights of the first N layers in the CPU

ggml-org/llama.cpp#15077
@g0t4
Copy link

g0t4 commented Aug 7, 2025

thank you! just got 108T/s with gpt-oss:120b on my dual 5090s with --n-cpu-moe 3... so awesome I haven't had time to see if I should tweak it further :)

llama-server -hf ggml-org/gpt-oss-120b-GGUF --ctx-size 0 --jinja --flash-attn --n-gpu-layers 99 --reasoning-format none --n-cpu-moe 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants