Skip to content

Conversation

CISC
Copy link
Collaborator

@CISC CISC commented Sep 18, 2025

Adds support for Ling(-flash) 2.0 and Ring 2.0, base and instruction tuned models (sans MTP, but layer included).

Includes expert group selection implemented as a ggml_custom function, not sure if there's a better way, or if it makes sense to implement some sort of masking op?

Requires #16060 to work on CUDA, also seems to be some issue with CPU, will check later. Either way, will reimplement expert group selection with set_rows once possible.

@CISC CISC requested a review from ggerganov September 18, 2025 02:34
@github-actions github-actions bot added the python python script changes label Sep 18, 2025
@CISC CISC linked an issue Sep 18, 2025 that may be closed by this pull request
4 tasks
@CISC CISC marked this pull request as draft September 18, 2025 10:53
@yqy3214
Copy link

yqy3214 commented Sep 18, 2025

Hello,I think it's necessary to split and permute the Q/K weights because of the difference in the implementation of Rotary Position Embedding (RoPE).

In the Hugging Face transformers library, the implementation is as follows:

def rotate_half(x):
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)

This implementation splits the features into two contiguous halves and then rotates them.

However, the implementation in llama.cpp is equivalent to this:

def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., 0::2]
    x2 = x[..., 1::2]
    return torch.stack((-x2, x1), dim=-1).flatten(-2)

This version interleaves the features, selecting even and odd indices separately before rotating.

Therefore, to ensure compatibility, the model parameters for Q and K need to be split and permuted to match the llama.cpp RoPE implementation, similar to the approach in this commit: 09e3df4

@CISC
Copy link
Collaborator Author

CISC commented Sep 18, 2025

Hello,I think it's necessary to split and permute the Q/K weights because of the difference in the implementation of Rotary Position Embedding (RoPE).
[...]
Therefore, to ensure compatibility, the model parameters for Q and K need to be split and permuted to match the llama.cpp RoPE implementation

Nope, this is standard NeoX RoPE, llama.cpp supports that internally.

@yqy3214
Copy link

yqy3214 commented Sep 18, 2025

Hello,I think it's necessary to split and permute the Q/K weights because of the difference in the implementation of Rotary Position Embedding (RoPE).您好,我认为由于旋转位置嵌入(RoPE)的实现存在差异,因此有必要对 Q/K 权重进行拆分和排列。
[...]
Therefore, to ensure compatibility, the model parameters for Q and K need to be split and permuted to match the llama.cpp RoPE implementation因此,为了确保兼容性,需要拆分和排列 Q 和 K 的模型参数,以匹配 llama.cpp RoPE 实现

Nope, this is standard NeoX RoPE, llama.cpp supports that internally.不,这是标准的 NeoX RoPE, llama.cpp 在内部支持它。

Got it, thanks for pointing that out!

@im0qianqian
Copy link

Hello, could we flip this PR to “Open” and put it up for review?

We're looking forward to being able to use it directly from our laptop as soon as possible. It's a lightweight model that also offers sufficiently high inference speed, and we believe it has the potential to be many people's next-generation AI assistant.

@CISC
Copy link
Collaborator Author

CISC commented Sep 22, 2025

Hello, could we flip this PR to “Open” and put it up for review?

I will after I've reimplemented the expert group selection, pending the merge of #16159

@im0qianqian
Copy link

Hi, I've observed some unexpected behaviors in the current PR during testing, particularly with long-context processing and multi-turn dialogues.
We performed consistency cross-checks between the work in PR #16028 and Huggingface's forward pass, comparing eval-callback results from PR #16028 and this PR. The outputs were nearly identical before applying ROPE, but showed significant divergence afterward.
There might be some non-ideal implementations causing this. I have carefully checked these codes, but have not found any errors yet.


screenshots

compare llama-cli w/ same input:

@CISC (#16063):

image

@im0qianqian (#16028):

image

compare eval callback:

Before ROPE:

image

After ROPE:
image


Reproduce

Step 1: Download Ling mini 2.0 model

https://huggingface.co/inclusionAI/Ling-mini-2.0/tree/main

Step 2: Convert hf to gguf

python convert_hf_to_gguf.py --outtype bf16 ./Ling-mini-2.0

Step 3: eval callback

./build-arm64-apple-clang-debug/bin/llama-eval-callback --log-prefix -v -m ./Ling-mini-2.0-BF16.gguf -p "你好,你是谁?" --log-file ling-mini-bf16.txt

Detail eval callback logs

@CISC
Copy link
Collaborator Author

CISC commented Sep 23, 2025

@im0qianqian Thanks for testing, but first make sure you are using the same chat template, yours does not match the jinja one, so use --jinja for both.

As for the eval callback it would be more interesting to compare against HF to see which direction each may deviate.

@im0qianqian
Copy link

im0qianqian commented Sep 23, 2025

@im0qianqian Thanks for testing, but first make sure you are using the same chat template, yours does not match the jinja one, so use --jinja for both.

As for the eval callback it would be more interesting to compare against HF to see which direction each may deviate.

Sorry — my baseline eval-callback log was flawed.
ROPE has been validated as correct, the issue lies in the group top-k router. When I reverted to the top-k router, the model output was normal.

> Hello, do you like watching TV or playing games?
I don't have personal preferences or the ability to watch TV or play games, as I'm an AI designed to assist with information and tasks. However, I can certainly help you explore your interests or provide recommendations based on what you enjoy! If you're looking for suggestions on TV shows, games, or anything else, feel free to ask. What do you enjoy most? 😊

> Please follow me and say: I'm a big fan of both TV and games! Watching TV lets me dive into different worlds, meeting new friends, and experiencing exciting adventures.
I'm a big fan of both TV and games! Watching TV lets me dive into different worlds, meeting new friends, and experiencing exciting adventures. If you have any specific recommendations or topics you'd like to explore further, just let me know! 😊

> Where is the sun?
The sun is a star at the center of our solar system. It is about 93 million miles (150 million kilometers) away from Earth. If you're asking about its current position in the sky, that depends on your location and the time of day. Would you like to know more about the sun's position at a specific time and place?

image image

This is the new eval callback log: ling-mini-bf16-topk-router.txt

@CISC
Copy link
Collaborator Author

CISC commented Sep 23, 2025

There is probably some issue with it right now, yes, let's retest once I reimplement it with set_rows.

No need to disable ffn_moe_weights_sum_biased, that part is correct at least:
https://huggingface.co/inclusionAI/Ling-flash-base-2.0/blob/5044774b682c9bdc49c9f9beabe3c956d0f92d07/modeling_bailing_moe_v2.py#L345

@warshanks
Copy link

warshanks commented Sep 25, 2025

Just checking in on this, are we waiting on another PR to merge? I noticed Inclusion themselves published a GGUF ahead of official support. It would be nice to get official support merged!

https://huggingface.co/inclusionAI/Ling-mini-2.0-GGUF
https://huggingface.co/inclusionAI/Ring-mini-2.0-GGUF

@CISC
Copy link
Collaborator Author

CISC commented Sep 25, 2025

Just checking in on this, are we waiting on another PR to merge? I noticed Inclusion themselves published GGUF ahead of official support. It would be nice to get official support merged!

Sigh, it's preferable not to release GGUFs before a PR is merged, there may be breaking changes during review...

@jacekpoplawski
Copy link
Contributor

Sigh, it's preferable not to release GGUFs before a PR is merged, there may be breaking changes during review...

they have their own fork for inference

@wangsff
Copy link

wangsff commented Sep 26, 2025

Hi, when is this PR expected to be merged into master?
I'm looking forward to trying out the LING on my laptop

@im0qianqian
Copy link

Hello, perhaps we have a better solution. The Ling mini 2.0 is a small-sized model with strong capabilities and exceptionally fast inference performance. Currently, many developers in the community hope to use this model on platforms like llama.cpp / ollama / lm studio.

Therefore, we could prioritize merging a version without group routing support. This way, at least everyone can use the model normally. Once we resolve the issue in set_rows, we can then submit a PR supporting group routing to unlock the model’s more outstanding performance.

In my view, regarding group-wise routing, it’s primarily a pre-training strategy that combines optimizations like DeepEP under 256 experts to achieve better training efficiency. For inference, the model can still perform excellently even without group routing.

@CISC
Copy link
Collaborator Author

CISC commented Sep 26, 2025

Therefore, we could prioritize merging a version without group routing support. This way, at least everyone can use the model normally. Once we resolve the issue in set_rows, we can then submit a PR supporting group routing to unlock the model’s more outstanding performance.

I will try to resolve this today/this weekend, but as you say there really is no issue in releasing a version without it (as long as the metadata is there to support it in the future).

@JohannesGaessler
Copy link
Collaborator

I didn't write the kernel and have never touched it but that was my interpretation of the code as well.

@slaren
Copy link
Member

slaren commented Sep 28, 2025

For now, modifying the supports_op to check limits should be enough to avoid the crash, an optimization can come later. Btw the CPU implementation is also very bad for big tensors (it's a bubble sort).

if we had something like argmax2 that would be equivalent, but this works fine until then
@CISC CISC requested a review from slaren as a code owner September 29, 2025 09:13
@engrtipusultan
Copy link

@CISC is this still work in progress or is it halted for some reason.

@CISC
Copy link
Collaborator Author

CISC commented Oct 6, 2025

@CISC is this still work in progress or is it halted for some reason.

The expert group selection is not the same, but works well enough for now (can be improved later), so just waiting for review.

@oovloveme
Copy link

Can it support the new Ling-1T?

@CISC
Copy link
Collaborator Author

CISC commented Oct 9, 2025

Can it support the new Ling-1T?

Yes.

@wangsff
Copy link

wangsff commented Oct 10, 2025

Looking forward to this PR being merged into master ASAP!

@qingy1337
Copy link

@CISC is this still work in progress or is it halted for some reason.

The expert group selection is not the same, but works well enough for now (can be improved later), so just waiting for review.

Does this mean the GGUF will perform slightly worse than it theoretically should?

@CISC
Copy link
Collaborator Author

CISC commented Oct 11, 2025

@CISC is this still work in progress or is it halted for some reason.

The expert group selection is not the same, but works well enough for now (can be improved later), so just waiting for review.

Does this mean the GGUF will perform slightly worse than it theoretically should?

Probably subjective at best, it just means that it will not always select the same groups due to different scoring.

@wang824892540
Copy link

soo long

@CISC CISC linked an issue Oct 14, 2025 that may be closed by this pull request
4 tasks
@wateryuen
Copy link

Is there any way to get the reviewers to notice this?

@ikawrakow
Copy link
Contributor

This comments answers this question

CUDA build

./bin/llama-bench -m Ling-mini-2.0-Q4_K_M.gguf -p 2048 -n 0 -t 1 -ngl 100 -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
/home/iwan/other/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:88: CUDA error
[New LWP 4091594]
[New LWP 4091598]
[New LWP 4091599]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007753f58ea42f in __GI___wait4 (pid=4091602, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007753f58ea42f in __GI___wait4 (pid=4091602, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007753f5f7286b in ggml_print_backtrace () from /home/iwan/other/llama.cpp/ncuda/bin/libggml-base.so
#2  0x00007753f5f72a02 in ggml_abort () from /home/iwan/other/llama.cpp/ncuda/bin/libggml-base.so
#3  0x00007753f4127ab6 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/iwan/other/llama.cpp/ncuda/bin/libggml-cuda.so
#4  0x00007753f4136155 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/iwan/other/llama.cpp/ncuda/bin/libggml-cuda.so
#5  0x00007753f5f8e367 in ggml_backend_sched_graph_compute_async () from /home/iwan/other/llama.cpp/ncuda/bin/libggml-base.so
#6  0x00007753f609fa71 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/iwan/other/llama.cpp/ncuda/bin/libllama.so
#7  0x00007753f609fe0d in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/iwan/other/llama.cpp/ncuda/bin/libllama.so
#8  0x00007753f60a66a7 in llama_context::decode(llama_batch const&) () from /home/iwan/other/llama.cpp/ncuda/bin/libllama.so
#9  0x00007753f60a7540 in llama_decode () from /home/iwan/other/llama.cpp/ncuda/bin/libllama.so
#10 0x0000587f8a484b5a in test_prompt(llama_context*, int, int, int) ()
#11 0x0000587f8a47f114 in main ()
[Inferior 1 (process 4091593) detached]
Aborted (core dumped)

CPU-only build

./bin/llama-bench -m ../../hf/bailingmoe-v2/Ling-mini-2.0-Q4_K_M.gguf -p 2048 -n 0 -t 16 -fa 1
| model                          |       size |     params | backend    | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------: | -------------------: |
/home/iwan/other/llama.cpp/ggml/src/ggml-cpu/ops.cpp:4663: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
/home/iwan/other/llama.cpp/ggml/src/ggml-cpu/ops.cpp:4663: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
/home/iwan/other/llama.cpp/ggml/src/ggml-cpu/ops.cpp:4663: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
/home/iwan/other/llama.cpp/ggml/src/ggml-cpu/ops.cpp:4663: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
/home/iwan/other/llama.cpp/ggml/src/ggml-cpu/ops.cpp:4663: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
/home/iwan/other/llama.cpp/ggml/src/ggml-cpu/ops.cpp:4663: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
/home/iwan/other/llama.cpp/ggml/src/ggml-cpu/ops.cpp:4663: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
/home/iwan/other/llama.cpp/ggml/src/ggml-cpu/ops.cpp:4663: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)
iwan@tdcu-7950X:~/other/llama.cpp/nbuild$ Unable to attach: program terminated with signal SIGABRT, Aborted.
No stack.
The program is not being run.
ptrace: No such process.
No stack.
The program is not being run.

The model is from here

@CISC
Copy link
Collaborator Author

CISC commented Oct 14, 2025

This comments answers this question

Thanks, reproduced, looking into it...

@CISC
Copy link
Collaborator Author

CISC commented Oct 14, 2025

@ikawrakow Fixed, thanks again for the report.

@engrtipusultan
Copy link

@ikawrakow Fixed, thanks again for the report.

Does this implementation cover linear attention ones as well?

https://huggingface.co/inclusionAI/Ring-mini-linear-2.0

@CISC
Copy link
Collaborator Author

CISC commented Oct 15, 2025

@ikawrakow Fixed, thanks again for the report.

Does this implementation cover linear attention ones as well?

https://huggingface.co/inclusionAI/Ring-mini-linear-2.0

No, it's a different architecture that requires Lightning Attention to be implemented.

@wangsff
Copy link

wangsff commented Oct 16, 2025

Could contact the reviewers to get this PR merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Ring-1T and Ling-1T support Feature Request: BailingMoeV2 Support (Ling Lite 2.0)