-
Notifications
You must be signed in to change notification settings - Fork 13.3k
model : add BailingMoeV2 support #16063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Hello,I think it's necessary to split and permute the Q/K weights because of the difference in the implementation of Rotary Position Embedding (RoPE). In the Hugging Face transformers library, the implementation is as follows:
This implementation splits the features into two contiguous halves and then rotates them. However, the implementation in llama.cpp is equivalent to this:
This version interleaves the features, selecting even and odd indices separately before rotating. Therefore, to ensure compatibility, the model parameters for Q and K need to be split and permuted to match the llama.cpp RoPE implementation, similar to the approach in this commit: 09e3df4 |
Nope, this is standard NeoX RoPE, |
Got it, thanks for pointing that out! |
Hello, could we flip this PR to “Open” and put it up for review? We're looking forward to being able to use it directly from our laptop as soon as possible. It's a lightweight model that also offers sufficiently high inference speed, and we believe it has the potential to be many people's next-generation AI assistant. |
I will after I've reimplemented the expert group selection, pending the merge of #16159 |
Hi, I've observed some unexpected behaviors in the current PR during testing, particularly with long-context processing and multi-turn dialogues. screenshotscompare llama-cli w/ same input:![]() ![]() compare eval callback:Before ROPE: ![]() ReproduceStep 1: Download Ling mini 2.0 model https://huggingface.co/inclusionAI/Ling-mini-2.0/tree/main Step 2: Convert hf to gguf python convert_hf_to_gguf.py --outtype bf16 ./Ling-mini-2.0 Step 3: eval callback ./build-arm64-apple-clang-debug/bin/llama-eval-callback --log-prefix -v -m ./Ling-mini-2.0-BF16.gguf -p "你好,你是谁?" --log-file ling-mini-bf16.txt Detail eval callback logs |
@im0qianqian Thanks for testing, but first make sure you are using the same chat template, yours does not match the jinja one, so use As for the eval callback it would be more interesting to compare against HF to see which direction each may deviate. |
Sorry — my baseline eval-callback log was flawed.
![]() ![]() This is the new eval callback log: ling-mini-bf16-topk-router.txt |
There is probably some issue with it right now, yes, let's retest once I reimplement it with No need to disable |
Just checking in on this, are we waiting on another PR to merge? I noticed Inclusion themselves published a GGUF ahead of official support. It would be nice to get official support merged! https://huggingface.co/inclusionAI/Ling-mini-2.0-GGUF |
Sigh, it's preferable not to release GGUFs before a PR is merged, there may be breaking changes during review... |
they have their own fork for inference |
Hi, when is this PR expected to be merged into master? |
Hello, perhaps we have a better solution. The Ling mini 2.0 is a small-sized model with strong capabilities and exceptionally fast inference performance. Currently, many developers in the community hope to use this model on platforms like llama.cpp / ollama / lm studio. Therefore, we could prioritize merging a version without group routing support. This way, at least everyone can use the model normally. Once we resolve the issue in In my view, regarding group-wise routing, it’s primarily a pre-training strategy that combines optimizations like DeepEP under 256 experts to achieve better training efficiency. For inference, the model can still perform excellently even without group routing. |
I will try to resolve this today/this weekend, but as you say there really is no issue in releasing a version without it (as long as the metadata is there to support it in the future). |
I didn't write the kernel and have never touched it but that was my interpretation of the code as well. |
For now, modifying the |
if we had something like argmax2 that would be equivalent, but this works fine until then
@CISC is this still work in progress or is it halted for some reason. |
The expert group selection is not the same, but works well enough for now (can be improved later), so just waiting for review. |
Can it support the new Ling-1T? |
Yes. |
Looking forward to this PR being merged into master ASAP! |
Does this mean the GGUF will perform slightly worse than it theoretically should? |
Probably subjective at best, it just means that it will not always select the same groups due to different scoring. |
soo long |
Is there any way to get the reviewers to notice this? |
This comments answers this question CUDA build
CPU-only build
The model is from here |
Thanks, reproduced, looking into it... |
@ikawrakow Fixed, thanks again for the report. |
Does this implementation cover linear attention ones as well? |
No, it's a different architecture that requires Lightning Attention to be implemented. |
Could contact the reviewers to get this PR merged? |
Adds support for Ling(-flash) 2.0 and Ring 2.0, base and instruction tuned models (sans MTP, but layer included).
Includes expert group selection implemented as aggml_custom
function, not sure if there's a better way, or if it makes sense to implement some sort of masking op?Requires #16060 to work on CUDA, also seems to be some issue with CPU, will check later. Either way, will reimplement expert group selection withset_rows
once possible.