Grouped expert routing (CPU only) #836

ikawrakow · 2025-10-16T09:12:59Z

This PR adds grouped experts routing as used by the BailingMoeV2 arch (Ling/Ring models).

It is CPU only, so for now disabled by default. It is enabled via -ger or --grouped-expert-routing.

Quick testing with Ling-mini-2.0 with full GPU offload shows only 20-30% performance degradation when using grouped expert routing (which runs on the CPU). For larger models and/or hybrid GPU/CPU inference the impact will be even less, so it is possible to try this option even before CUDA support is added.

The implementation in this PR is based on my interpretation of the original Python implementation, which clearly differs from @CISC's interpretation in the llama.cpp BailingMoeV2 PR.

The following table shows perplexities computed for Wikitext2 and the plain text version of the Pride and Prejudice novel from Project Gutenberg (column P&P in the table).

routing	Wiki2	P&P
standard	13.4016	31.7274
grouped (this PR)	13.5730	31.7236
grouped (@CISC PR)	34.9221	48.6346

Based on this, my guess is that the implementation in this PR is more likely to be correct than the implementation in @CISC's llama.cpp PR.

In terms of performance when running CPU-only, grouped expert routing in this PR is about the same or even very slightly better than standard expert top_k routing.

…laying1

CISC · 2025-10-16T09:59:10Z

So, was it my interpretation or implementation that was wrong?

ikawrakow · 2025-10-16T10:29:46Z

So, was it my interpretation or implementation that was wrong?

Hard to say.

My implementation does this (in plain English)

For each token, group the n_expert probabilities into n_groups groups of n_expert/n_group elements
Determine group scores for each group by using the sum of the top-2 elements in the group (i.e., top-2 highest probability experts in the group)
Select the n_used_groups groups with the highest group score
Apply normal top_k expert selection using only the experts in the selected n_used_groups. This can be done either by setting the score of experts not in the selected groups to -INFINITY, and then applying normal top_k expert selection (this is done in the Python implementation), or by directly fusing the top_k expert selection with the group selection to take advantage of the fact that we already know which experts participate in the top_k selection (the approach used in this PR).

In your PR you group all tokens into groups. There are matrix transpositions and what not. I can't tell if this is so because your interpretation of the Python implementation is different from mine, or if it is simply an incorrect implementation of the above algorithm using the limited capabilities of the underlying ggml toolkit.

CISC · 2025-10-16T10:51:36Z

In your PR you group all tokens into groups. There are matrix transpositions and what not. I can't tell if this is so because your interpretation of the Python implementation is different from mine, or if it is simply an incorrect implementation of the above algorithm using the limited capabilities of the underlying ggml toolkit.

Probably a little bit of both. :) So, AFAICT our interpretations are basically the same, it's just that I got things mixed up on the first step (I struggled a bit with this because it wasn't possible to use top_k the same way as in the Python code).

Ahwell, thanks for the reality check, I'll look into fixing the implementation when I have time.

ikawrakow · 2025-10-16T11:56:54Z

@ubergarm In case you are still into quant cooking, the Ling-1T/Ring-1T models are an opportunity to publish GGUFs before they become available for mainline. The PR there is still WIP, while I think the models should be functional in ik_llama.cpp after merging this PR. Just make sure to add -ger when computing the imatrix.

ubergarm · 2025-10-16T19:53:34Z

In case you are still into quant cooking, the Ling-1T/Ring-1T models are an opportunity to publish GGUFs

Thanks! I almost started downloading it a couple days ago, but am having hiccups with huggingface changing their public repo size allowances recently. I just subscribed at $9/mo to PRO level and hope that allows me to continue uploading quants as I'm over 26TB already...

Working through it now and comparing Ling-1T with Ling-flash-2.0:

Ling-flash-2.0$ grep moe_shared_expert_intermediate_size *.json
config.json:    "moe_shared_expert_intermediate_size": 1024

Ling-1T$ grep moe_shared_expert_intermediate_size *.json
# nothing

Created a small patch PR here: #837 and will update if anything else comes up along the way. Hopefully I'll be able to upload imatrix dat and quants to huggingface if all goes well. 🤞

CISC · 2025-10-16T23:48:36Z

@ikawrakow I think I got it right now: ggml-org/llama.cpp@bc6c48a

ikawrakow · 2025-10-17T16:04:25Z

@ikawrakow I think I got it right now: ggml-org/llama.cpp@bc6c48a

For some reason when I try to use the same implementation in ik_llama.cpp I get an assert in the ggml_set_rows op, so I ended up making a dedicated CUDA implementation (PR #838).

CISC · 2025-10-17T18:19:03Z

@ikawrakow I think I got it right now: ggml-org/llama.cpp@bc6c48a

For some reason when I try to use the same implementation in ik_llama.cpp I get an assert in the ggml_set_rows op, so I ended up making a dedicated CUDA implementation (PR #838).

Weird, though awesome with the CUDA implementation. I think it's possible to optimize away the masking (and thus set rows) though, I'll give it a go...

CISC · 2025-10-17T20:13:14Z

I think it's possible to optimize away the masking (and thus set rows) though, I'll give it a go...

Guess not, doing so skews the ids for mulmat ops later.

Iwan Kawrakow added 9 commits October 15, 2025 16:04

Better argsort (CPU)

5118036

Attemt at grouped topk

ffb3932

This seems to do the trick for grouped experts routing

bc656aa

Cleanup

3b8cb3b

Trying to merge, something is not right

ba3e181

Working merged grouped top_k (CPU)

c30c35b

Add command line option to enable grouped expert routing

3683e50

Add grouped expert routing option to llama-bench

b386d5d

Merge remote-tracking branch 'origin/main' into ik/try_grouped_topk_p…

2c43a98

…laying1

ikawrakow merged commit dbfd151 into main Oct 16, 2025

ikawrakow mentioned this pull request Oct 17, 2025

Grouped expert routing (CUDA) #838

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Grouped expert routing (CPU only) #836

Grouped expert routing (CPU only) #836

Uh oh!

ikawrakow commented Oct 16, 2025

Uh oh!

CISC commented Oct 16, 2025

Uh oh!

ikawrakow commented Oct 16, 2025

Uh oh!

CISC commented Oct 16, 2025

Uh oh!

ikawrakow commented Oct 16, 2025

Uh oh!

ubergarm commented Oct 16, 2025 •

edited

Loading

Uh oh!

CISC commented Oct 16, 2025

Uh oh!

ikawrakow commented Oct 17, 2025

Uh oh!

CISC commented Oct 17, 2025

Uh oh!

CISC commented Oct 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Grouped expert routing (CPU only) #836

Grouped expert routing (CPU only) #836

Uh oh!

Conversation

ikawrakow commented Oct 16, 2025

Uh oh!

CISC commented Oct 16, 2025

Uh oh!

ikawrakow commented Oct 16, 2025

Uh oh!

CISC commented Oct 16, 2025

Uh oh!

ikawrakow commented Oct 16, 2025

Uh oh!

ubergarm commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Oct 16, 2025

Uh oh!

ikawrakow commented Oct 17, 2025

Uh oh!

CISC commented Oct 17, 2025

Uh oh!

CISC commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ubergarm commented Oct 16, 2025 •

edited

Loading

CISC commented Oct 17, 2025 •

edited

Loading