llama-server: allow layer groups in --n-cpu-moe #15975
                
     Closed
            
            
          
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Currently,
--n-cpu-moetakes a single parameter N: the number of MOE layers to offload to CPU. This is interpreted as a range0, N-1. This works well enough when using a single GPU but there are some major complications when multiple GPUs are used. For example, when using a model like unsloth/Qwen3-30B-A3B-Instruct-2507-Q2_K on two Nvidia 3090s:--n-cpu-moe 44, this means that all MOE layers that are offloaded are from the first GPU, which leaves its VRAM underutilized while the VRAM of the second GPU is not enough and we get an out of memory error.-tsargument. We need to try to start e.g.llama-servermultiple times with different-tsto see which combination will utilize the best VRAM of both GPUs. In this case the best option is-ts 5,2, which will place layers 0-65 on GPU0 and layers 65-94 on GPU1.There are two problems with this:
--n-cpu-moeand--tsvalues can be quite time-consuming, especially with large models which load relatively slowly.The proposed change allows specifying layer ranges in
--n-cpu-moelike this:--n-cpu-moe Noffloads MOE layers 0 to N-1 to the CPU (the same as the current behavior)--n-cpu-moe N1-N2,N3-N4offloads MOE layers N1 to N2, and N3 to N4 to the CPU and allows distributing the layers more evenly between the GPUs--n-cpu-moe N1-N2,N3- if there is a comma and a single value is specified, it is treated as a layer index to be offloaded, so in this case MOE layers N1 to N2, and layer N3 will be offloaded to the CPUWhit these changes, we can use
--n-cpu-moe 0-22,48-68, which still offloads 44 MOE layers to the CPU but without the non-intuitive-ts 5,2and the result is that layers 0-47 (with 23 MOE layers offloaded) are on GPU0, and layers 48-94 (with 21 MOE layers offloaded) are on GPU1, which is much more even distribution than before.Here are the test results on the following machine:
Results with
--n-cpu-moe 44 --ts 5,2:Results with
--n-cpu-moe 0-22,48-68: