-
Notifications
You must be signed in to change notification settings - Fork 163
Grouped expert routing (CPU only) #836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
So, was it my interpretation or implementation that was wrong? |
Hard to say. My implementation does this (in plain English)
In your PR you group all tokens into groups. There are matrix transpositions and what not. I can't tell if this is so because your interpretation of the Python implementation is different from mine, or if it is simply an incorrect implementation of the above algorithm using the limited capabilities of the underlying |
Probably a little bit of both. :) So, AFAICT our interpretations are basically the same, it's just that I got things mixed up on the first step (I struggled a bit with this because it wasn't possible to use Ahwell, thanks for the reality check, I'll look into fixing the implementation when I have time. |
|
@ubergarm In case you are still into quant cooking, the Ling-1T/Ring-1T models are an opportunity to publish GGUFs before they become available for mainline. The PR there is still WIP, while I think the models should be functional in |
Thanks! I almost started downloading it a couple days ago, but am having hiccups with huggingface changing their public repo size allowances recently. I just subscribed at $9/mo to Working through it now and comparing Created a small patch PR here: #837 and will update if anything else comes up along the way. Hopefully I'll be able to upload imatrix dat and quants to huggingface if all goes well. 🤞 |
|
@ikawrakow I think I got it right now: ggml-org/llama.cpp@bc6c48a |
For some reason when I try to use the same implementation in |
Weird, though awesome with the CUDA implementation. I think it's possible to optimize away the masking (and thus set rows) though, I'll give it a go... |
Guess not, doing so skews the ids for mulmat ops later. |
This PR adds grouped experts routing as used by the BailingMoeV2 arch (Ling/Ring models).
It is CPU only, so for now disabled by default. It is enabled via
-geror--grouped-expert-routing.Quick testing with Ling-mini-2.0 with full GPU offload shows only 20-30% performance degradation when using grouped expert routing (which runs on the CPU). For larger models and/or hybrid GPU/CPU inference the impact will be even less, so it is possible to try this option even before CUDA support is added.
The implementation in this PR is based on my interpretation of the original Python implementation, which clearly differs from @CISC's interpretation in the llama.cpp BailingMoeV2 PR.
The following table shows perplexities computed for Wikitext2 and the plain text version of the Pride and Prejudice novel from Project Gutenberg (column P&P in the table).
Based on this, my guess is that the implementation in this PR is more likely to be correct than the implementation in @CISC's llama.cpp PR.
In terms of performance when running CPU-only, grouped expert routing in this PR is about the same or even very slightly better than standard expert top_k routing.