-
Notifications
You must be signed in to change notification settings - Fork 163
Grouped expert routing (CUDA) #838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for putting together CUDA I've seen at least a couple positive reports suggesting |
|
Had a recent report on hf suggesting that
https://huggingface.co/ubergarm/Ling-1T-GGUF/discussions/6 Hopefully they respond either here or there. |
|
Well, they should crate an issue with detailed description. Grouped expert routing is working just fine for me, and I was even thinking to make it the default and change the flag to |
I have the same config and getting a segfault with -ger enabled. It runs fine without -ger: I am creating an issue then? |
|
Hi -
Just to clarify it works when I remove -ngl 99 \ and i can run fully CPU but can't load to the GPUs version: 3919 (5ae87f6) Here is the log : xeon@xeon-System-Product-Name:~/ik_llama.cpp/models$ numactl -N 0 -m 0 ggml_backend_register: registered backend CPU |
|
Test results for 12C CPU + 2xRTX3090 for ubergarm/Ling-1T-GGUF/smol-IQ4-KSS with -ger disabled: [EDIT]: GPU usage: 23.8GB each. |
|
As related to the segfault. [EDIT]: so its null-pointer dereference. the following patch proves (that is, no crash occurs but the data is a garbage) that something is wrong with topk. the src is placed with 0x18 offset, right? ggml/src/ggml-cuda/argsort.cu |
This PR adds CUDA implementation of grouped experts routing as used by the BailingMoeV2 arch (Ling/Ring models).
I did try initially to get a single kernel implementation as on the CPU (PR #836), but that wasn't working, so the op is composed of several separate kernel launches. Still, performance is not too bad with just a few percent degradation compared to standard routing (see tables below), and definitely much better compared to having the grouped expert top_k op done on the CPU.
The mainline BalingMoeV2 PR has not been merged yet, so will not give performance comparisons to
llama.cpp.Ling-mini-2.0-Q4_K_M, standard expert routing
Ling-mini-2.0-Q4_K_M, grouped expert routing