model : add BailingMoeV2 support #16063

CISC · 2025-09-18T02:34:34Z

Adds support for Ling(-flash) 2.0 and Ring 2.0, base and instruction tuned models (sans MTP, but layer included).

~~Includes expert group selection implemented as a ggml_custom function, not sure if there's a better way, or if it makes sense to implement some sort of masking op?~~

~~Requires #16060 to work on CUDA, also seems to be some issue with CPU, will check later. Either way, will reimplement expert group selection with set_rows once possible.~~

src/llama-graph.cpp

yqy3214 · 2025-09-18T12:23:26Z

Hello，I think it's necessary to split and permute the Q/K weights because of the difference in the implementation of Rotary Position Embedding (RoPE).

In the Hugging Face transformers library, the implementation is as follows:

def rotate_half(x):
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

This implementation splits the features into two contiguous halves and then rotates them.

However, the implementation in llama.cpp is equivalent to this:

def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., 0::2]
    x2 = x[..., 1::2]
    return torch.stack((-x2, x1), dim=-1).flatten(-2)

This version interleaves the features, selecting even and odd indices separately before rotating.

Therefore, to ensure compatibility, the model parameters for Q and K need to be split and permuted to match the llama.cpp RoPE implementation, similar to the approach in this commit: 09e3df4

CISC · 2025-09-18T12:26:37Z

Hello，I think it's necessary to split and permute the Q/K weights because of the difference in the implementation of Rotary Position Embedding (RoPE).
[...]
Therefore, to ensure compatibility, the model parameters for Q and K need to be split and permuted to match the llama.cpp RoPE implementation

Nope, this is standard NeoX RoPE, llama.cpp supports that internally.

yqy3214 · 2025-09-18T12:59:04Z

Hello，I think it's necessary to split and permute the Q/K weights because of the difference in the implementation of Rotary Position Embedding (RoPE).您好，我认为由于旋转位置嵌入（RoPE）的实现存在差异，因此有必要对 Q/K 权重进行拆分和排列。
[...]
Therefore, to ensure compatibility, the model parameters for Q and K need to be split and permuted to match the llama.cpp RoPE implementation因此，为了确保兼容性，需要拆分和排列 Q 和 K 的模型参数，以匹配 llama.cpp RoPE 实现

Nope, this is standard NeoX RoPE, llama.cpp supports that internally.不，这是标准的 NeoX RoPE， llama.cpp 在内部支持它。

Got it, thanks for pointing that out!

im0qianqian · 2025-09-22T06:43:47Z

Hello, could we flip this PR to “Open” and put it up for review?

We're looking forward to being able to use it directly from our laptop as soon as possible. It's a lightweight model that also offers sufficiently high inference speed, and we believe it has the potential to be many people's next-generation AI assistant.

CISC · 2025-09-22T06:48:38Z

Hello, could we flip this PR to “Open” and put it up for review?

I will after I've reimplemented the expert group selection, pending the merge of #16159

im0qianqian · 2025-09-23T10:24:45Z

Hi, I've observed some unexpected behaviors in the current PR during testing, particularly with long-context processing and multi-turn dialogues.
We performed consistency cross-checks between the work in PR #16028 and Huggingface's forward pass, comparing eval-callback results from PR #16028 and this PR. The outputs were nearly identical before applying ROPE, but showed significant divergence afterward.
There might be some non-ideal implementations causing this. I have carefully checked these codes, but have not found any errors yet.

screenshots

compare llama-cli w/ same input:

@CISC (#16063):

@im0qianqian (#16028):

compare eval callback:

Before ROPE:

After ROPE:

Reproduce

Step 1: Download Ling mini 2.0 model

https://huggingface.co/inclusionAI/Ling-mini-2.0/tree/main

Step 2: Convert hf to gguf

python convert_hf_to_gguf.py --outtype bf16 ./Ling-mini-2.0

Step 3: eval callback

./build-arm64-apple-clang-debug/bin/llama-eval-callback --log-prefix -v -m ./Ling-mini-2.0-BF16.gguf -p "你好，你是谁？" --log-file ling-mini-bf16.txt

Detail eval callback logs

CISC · 2025-09-23T11:09:27Z

@im0qianqian Thanks for testing, but first make sure you are using the same chat template, yours does not match the jinja one, so use --jinja for both.

As for the eval callback it would be more interesting to compare against HF to see which direction each may deviate.

im0qianqian · 2025-09-23T12:35:27Z

@im0qianqian Thanks for testing, but first make sure you are using the same chat template, yours does not match the jinja one, so use --jinja for both.

As for the eval callback it would be more interesting to compare against HF to see which direction each may deviate.

Sorry — my baseline eval-callback log was flawed.
ROPE has been validated as correct, the issue lies in the group top-k router. When I reverted to the top-k router, the model output was normal.

> Hello, do you like watching TV or playing games?
I don't have personal preferences or the ability to watch TV or play games, as I'm an AI designed to assist with information and tasks. However, I can certainly help you explore your interests or provide recommendations based on what you enjoy! If you're looking for suggestions on TV shows, games, or anything else, feel free to ask. What do you enjoy most? 😊

> Please follow me and say: I'm a big fan of both TV and games! Watching TV lets me dive into different worlds, meeting new friends, and experiencing exciting adventures.
I'm a big fan of both TV and games! Watching TV lets me dive into different worlds, meeting new friends, and experiencing exciting adventures. If you have any specific recommendations or topics you'd like to explore further, just let me know! 😊

> Where is the sun?
The sun is a star at the center of our solar system. It is about 93 million miles (150 million kilometers) away from Earth. If you're asking about its current position in the sky, that depends on your location and the time of day. Would you like to know more about the sun's position at a specific time and place?

This is the new eval callback log: ling-mini-bf16-topk-router.txt

CISC · 2025-09-23T12:49:40Z

There is probably some issue with it right now, yes, let's retest once I reimplement it with set_rows.

No need to disable ffn_moe_weights_sum_biased, that part is correct at least:
https://huggingface.co/inclusionAI/Ling-flash-base-2.0/blob/5044774b682c9bdc49c9f9beabe3c956d0f92d07/modeling_bailing_moe_v2.py#L345

warshanks · 2025-09-25T15:10:22Z

Just checking in on this, are we waiting on another PR to merge? I noticed Inclusion themselves published a GGUF ahead of official support. It would be nice to get official support merged!

https://huggingface.co/inclusionAI/Ling-mini-2.0-GGUF
https://huggingface.co/inclusionAI/Ring-mini-2.0-GGUF

CISC · 2025-09-25T15:15:41Z

Just checking in on this, are we waiting on another PR to merge? I noticed Inclusion themselves published GGUF ahead of official support. It would be nice to get official support merged!

Sigh, it's preferable not to release GGUFs before a PR is merged, there may be breaking changes during review...

jacekpoplawski · 2025-09-25T15:54:31Z

Sigh, it's preferable not to release GGUFs before a PR is merged, there may be breaking changes during review...

they have their own fork for inference

wangsff · 2025-09-26T08:02:13Z

Hi, when is this PR expected to be merged into master?
I'm looking forward to trying out the LING on my laptop

im0qianqian · 2025-09-26T08:37:08Z

Hello, perhaps we have a better solution. The Ling mini 2.0 is a small-sized model with strong capabilities and exceptionally fast inference performance. Currently, many developers in the community hope to use this model on platforms like llama.cpp / ollama / lm studio.

Therefore, we could prioritize merging a version without group routing support. This way, at least everyone can use the model normally. Once we resolve the issue in set_rows, we can then submit a PR supporting group routing to unlock the model’s more outstanding performance.

In my view, regarding group-wise routing, it’s primarily a pre-training strategy that combines optimizations like DeepEP under 256 experts to achieve better training efficiency. For inference, the model can still perform excellently even without group routing.

CISC · 2025-09-26T08:54:21Z

Therefore, we could prioritize merging a version without group routing support. This way, at least everyone can use the model normally. Once we resolve the issue in set_rows, we can then submit a PR supporting group routing to unlock the model’s more outstanding performance.

I will try to resolve this today/this weekend, but as you say there really is no issue in releasing a version without it (as long as the metadata is there to support it in the future).

JohannesGaessler · 2025-09-28T23:24:23Z

I didn't write the kernel and have never touched it but that was my interpretation of the code as well.

slaren · 2025-09-28T23:28:22Z

For now, modifying the supports_op to check limits should be enough to avoid the crash, an optimization can come later. Btw the CPU implementation is also very bad for big tensors (it's a bubble sort).

if we had something like argmax2 that would be equivalent, but this works fine until then

engrtipusultan · 2025-10-06T05:57:10Z

@CISC is this still work in progress or is it halted for some reason.

CISC · 2025-10-06T08:15:01Z

@CISC is this still work in progress or is it halted for some reason.

The expert group selection is not the same, but works well enough for now (can be improved later), so just waiting for review.

oovloveme · 2025-10-09T04:35:13Z

Can it support the new Ling-1T？

CISC · 2025-10-09T08:05:12Z

Can it support the new Ling-1T？

Yes.

wangsff · 2025-10-10T03:36:53Z

Looking forward to this PR being merged into master ASAP!

qingy1337 · 2025-10-11T20:05:37Z

@CISC is this still work in progress or is it halted for some reason.

The expert group selection is not the same, but works well enough for now (can be improved later), so just waiting for review.

Does this mean the GGUF will perform slightly worse than it theoretically should?

CISC · 2025-10-11T20:44:31Z

@CISC is this still work in progress or is it halted for some reason.

The expert group selection is not the same, but works well enough for now (can be improved later), so just waiting for review.

Does this mean the GGUF will perform slightly worse than it theoretically should?

Probably subjective at best, it just means that it will not always select the same groups due to different scoring.

wang824892540 · 2025-10-14T07:44:30Z

soo long

wateryuen · 2025-10-14T18:08:57Z

Is there any way to get the reviewers to notice this?

ikawrakow · 2025-10-14T19:15:03Z

This comments answers this question

CUDA build

./bin/llama-bench -m Ling-mini-2.0-Q4_K_M.gguf -p 2048 -n 0 -t 1 -ngl 100 -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
/home/iwan/other/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:88: CUDA error
[New LWP 4091594]
[New LWP 4091598]
[New LWP 4091599]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007753f58ea42f in __GI___wait4 (pid=4091602, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007753f58ea42f in __GI___wait4 (pid=4091602, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007753f5f7286b in ggml_print_backtrace () from /home/iwan/other/llama.cpp/ncuda/bin/libggml-base.so
#2  0x00007753f5f72a02 in ggml_abort () from /home/iwan/other/llama.cpp/ncuda/bin/libggml-base.so
#3  0x00007753f4127ab6 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/iwan/other/llama.cpp/ncuda/bin/libggml-cuda.so
#4  0x00007753f4136155 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/iwan/other/llama.cpp/ncuda/bin/libggml-cuda.so
#5  0x00007753f5f8e367 in ggml_backend_sched_graph_compute_async () from /home/iwan/other/llama.cpp/ncuda/bin/libggml-base.so
#6  0x00007753f609fa71 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/iwan/other/llama.cpp/ncuda/bin/libllama.so
#7  0x00007753f609fe0d in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/iwan/other/llama.cpp/ncuda/bin/libllama.so
#8  0x00007753f60a66a7 in llama_context::decode(llama_batch const&) () from /home/iwan/other/llama.cpp/ncuda/bin/libllama.so
#9  0x00007753f60a7540 in llama_decode () from /home/iwan/other/llama.cpp/ncuda/bin/libllama.so
#10 0x0000587f8a484b5a in test_prompt(llama_context*, int, int, int) ()
#11 0x0000587f8a47f114 in main ()
[Inferior 1 (process 4091593) detached]
Aborted (core dumped)

CPU-only build

./bin/llama-bench -m ../../hf/bailingmoe-v2/Ling-mini-2.0-Q4_K_M.gguf -p 2048 -n 0 -t 16 -fa 1
| model                          |       size |     params | backend    | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------: | -------------------: |
/home/iwan/other/llama.cpp/ggml/src/ggml-cpu/ops.cpp:4663: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
/home/iwan/other/llama.cpp/ggml/src/ggml-cpu/ops.cpp:4663: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
/home/iwan/other/llama.cpp/ggml/src/ggml-cpu/ops.cpp:4663: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
/home/iwan/other/llama.cpp/ggml/src/ggml-cpu/ops.cpp:4663: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
/home/iwan/other/llama.cpp/ggml/src/ggml-cpu/ops.cpp:4663: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
/home/iwan/other/llama.cpp/ggml/src/ggml-cpu/ops.cpp:4663: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
/home/iwan/other/llama.cpp/ggml/src/ggml-cpu/ops.cpp:4663: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
/home/iwan/other/llama.cpp/ggml/src/ggml-cpu/ops.cpp:4663: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)
iwan@tdcu-7950X:~/other/llama.cpp/nbuild$ Unable to attach: program terminated with signal SIGABRT, Aborted.
No stack.
The program is not being run.
ptrace: No such process.
No stack.
The program is not being run.

The model is from here

CISC · 2025-10-14T20:46:15Z

This comments answers this question

Thanks, reproduced, looking into it...

CISC · 2025-10-14T21:11:40Z

@ikawrakow Fixed, thanks again for the report.

engrtipusultan · 2025-10-15T04:23:33Z

@ikawrakow Fixed, thanks again for the report.

Does this implementation cover linear attention ones as well?

https://huggingface.co/inclusionAI/Ring-mini-linear-2.0

CISC · 2025-10-15T07:24:32Z

@ikawrakow Fixed, thanks again for the report.

Does this implementation cover linear attention ones as well?

https://huggingface.co/inclusionAI/Ring-mini-linear-2.0

No, it's a different architecture that requires Lightning Attention to be implemented.

wangsff · 2025-10-16T11:50:00Z

Could contact the reviewers to get this PR merged?

CISC added 7 commits September 18, 2025 03:55

add BailingMoeV2 support

ece59ac

update llm types

d3392b0

undo

ed6a609

undo

dd3a80b

update llm types

23f5693

add model collection link

35976ba

update

15ed1c6

CISC requested a review from ggerganov September 18, 2025 02:34

github-actions bot added the python python script changes label Sep 18, 2025

ggerganov reviewed Sep 18, 2025

View reviewed changes

src/llama-graph.cpp Show resolved Hide resolved

ggerganov mentioned this pull request Sep 18, 2025

result type of argsort: make it I64? #16001

Closed

CISC linked an issue Sep 18, 2025 that may be closed by this pull request

Feature Request: BailingMoeV2 Support (Ling Lite 2.0) #15968

Open

4 tasks

CISC marked this pull request as draft September 18, 2025 10:53

Merge branch 'master' into cisc/bailingmoe2

2ab5409

almost working

6ee701d

CISC added 2 commits September 29, 2025 04:22

avoid large top_k and use argmax instead for now

39e8073

if we had something like argmax2 that would be equivalent, but this works fine until then

poke

4ea8a45

CISC requested a review from slaren as a code owner September 29, 2025 09:13

Merge branch 'master' into cisc/bailingmoe2

6870b16

Merge branch 'master' into cisc/bailingmoe2

362710a

Merge branch 'master' into cisc/bailingmoe2

f82f050

wqerrewetw mentioned this pull request Oct 13, 2025

Feature Request: Ring-1T and Ling-1T support #16567

Open

4 tasks

Lissanro mentioned this pull request Oct 14, 2025

Feature Request: Please add Ring-1T support ikawrakow/ik_llama.cpp#813

Closed

4 tasks

CISC linked an issue Oct 14, 2025 that may be closed by this pull request

Feature Request: Ring-1T and Ling-1T support #16567

Open

4 tasks

ikawrakow mentioned this pull request Oct 14, 2025

Adding Ling/Ring (a.k.a., Bailing-MoE2) support ikawrakow/ik_llama.cpp#833

Merged

skip group selection when there are no tokens

e5c32f2

ikawrakow mentioned this pull request Oct 16, 2025

Grouped expert routing (CPU only) ikawrakow/ik_llama.cpp#836

Merged

rick-github mentioned this pull request Oct 16, 2025

Add InclusionAI's models ollama/ollama#12652

Open

model : add BailingMoeV2 support #16063

Are you sure you want to change the base?

model : add BailingMoeV2 support #16063

Conversation

CISC commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yqy3214 commented Sep 18, 2025

Uh oh!

CISC commented Sep 18, 2025

Uh oh!

yqy3214 commented Sep 18, 2025

Uh oh!

im0qianqian commented Sep 22, 2025

Uh oh!

CISC commented Sep 22, 2025

Uh oh!

im0qianqian commented Sep 23, 2025

screenshots

compare llama-cli w/ same input:

compare eval callback:

Reproduce

Detail eval callback logs

Uh oh!

CISC commented Sep 23, 2025

Uh oh!

im0qianqian commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Sep 23, 2025

Uh oh!

warshanks commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Sep 25, 2025

Uh oh!

jacekpoplawski commented Sep 25, 2025

Uh oh!

wangsff commented Sep 26, 2025

Uh oh!

im0qianqian commented Sep 26, 2025

Uh oh!

CISC commented Sep 26, 2025

Uh oh!

JohannesGaessler commented Sep 28, 2025

Uh oh!

slaren commented Sep 28, 2025

Uh oh!

engrtipusultan commented Oct 6, 2025

Uh oh!

CISC commented Oct 6, 2025

Uh oh!

oovloveme commented Oct 9, 2025

Uh oh!

CISC commented Oct 9, 2025

Uh oh!

wangsff commented Oct 10, 2025

Uh oh!

qingy1337 commented Oct 11, 2025

Uh oh!

CISC commented Oct 11, 2025

Uh oh!

wang824892540 commented Oct 14, 2025

Uh oh!

wateryuen commented Oct 14, 2025

Uh oh!

ikawrakow commented Oct 14, 2025

CUDA build

CPU-only build

Uh oh!

CISC commented Oct 14, 2025

Uh oh!

CISC commented Oct 14, 2025

Uh oh!

engrtipusultan commented Oct 15, 2025

Uh oh!

CISC commented Oct 15, 2025

Uh oh!

wangsff commented Oct 16, 2025

CISC commented Sep 18, 2025 •

edited

Loading

im0qianqian commented Sep 23, 2025 •

edited

Loading

warshanks commented Sep 25, 2025 •

edited

Loading