-
Notifications
You must be signed in to change notification settings - Fork 154
Enable CUDA graphs for MoE models + GPT-OSS support #689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Model loads and runs (CPU only), but PPL is much to high (~1500 for 1st batch vs ~200 in mainline). Is it because of SWA, because of vocab, or did I introduce a bug somewhere?
It was the SWA thta was missing in the previous commit. There are issues with EOG tokens, so this still needs to be added.
Just a copy from mainline
Haven't turned it on yet, but observe slightly better PP and slightly worse TG performance with that.
Turning it off for now as performance becomes more variable, so perhaps I'm running into thermal trottling imore often because of making the CPU work too hard.
Likely not all MLA variants are working. I no longer remember why I added the q8_0 cpy that transposes the tensor, but if really needed, this is now missing. Also missing is q6_0.
I'm starting to have doubts that @ikawrakow is even human or if he's AGI. This is really impressive work. I'll be testing the perfs, thank you so much! |
Haha. If that were true, it will be ikawrakowAGI becoming the multi-trillion dollar company, not OpenAI or Antropic or Google or ... So perhaps it is time for people to start investing into that 😆 |
Is there effect for hybrid inference? |
I'm facing this issue when running
Similar issue documented here: pytorch/pytorch#87794 |
@Thireus You need That being said, I though that I had fixed it to work without |
@ikawrakow - haha, yes much better with |
@ubergarm Thanks for the benchmarks! Interesting that in your case the gain is quite a bit smaller compared to what I observe. Not really sure why. Model? GPU?, Driver? (although you are on 570, so newer driver than the 565 I have installed on the system where I tested) |
Okay, seeing good uplift here with this PR running ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF "pure" Q4_0 full GPU offload on my 3090TI FE 24GB VRAM GPU with newer CUDA drivers on Arch Linux. (this is a rare mainline compatible quant I released mainly for testing vulkan) In fact, now CUDA backend is faster TG than vulkan here on ik's fork again for this one. ![]() 👈 Details./build/bin/llama-sweep-bench \
--model "$model" \
-fa \
-c 20480 \
-ngl 99 \
--threads 1 \
-ub 4096 -b 4096 \
--warmup-batch ik PR CUDA -ub 4096 -b 4096 -fa -fmoe
ik main CUDA -ub 4096 -b 4096 -fa -fmoe
ik main Vulkan NV_coopmat2 -ub 4096 -b 4096 -fa
mainline lcpp CUDA master@7a0de960 + ug/port-sweep-bench -ub 4096 -b 4096 -fa
mainline lcpp Vulkan NV_coopmat2 master@7a0de960 + ug/port-sweep-bench -ub 4096 -b 4096 -fa
|
Btw, there is this issue in mainline, which is quite interesting. Somebody has noticed that So, I think, one definitely needs |
Some stats at large context length. It looks like CUDA graphs are bringing a +9% TG speed increase for ik_llama.cpp. The gap with llama.cpp is quite noticeable though. llama.cpp (main) - Windows builds: b6168
Large Prompt - Round 1:
Large Prompt - Round 2:
ik_llama.cpp (main) - Windows builds: main-b4074-62ef02e
Large Prompt - Round 1:
Large Prompt - Round 2:
ik_llama.cpp (8a83e1f) - Windows builds: ik-try_cuda_graphs-b4107-7693263
Large Prompt - Round 1:
Large Prompt - Round 2:
Recipe used: GLM-4.5-Air.ROOT-4.7789bpw-4.6437ppl.63GB-GGUF_5GB-GPU_58GB-CPU.6d32a73_ed85f05.recipe |
What? They did merge ggml-org/llama.cpp#11571 in (a cleaner, revision from what I ported which was an earlier commit of it), and then updated it to be faster (ggml-org/llama.cpp#14753), at least that was the state of it I was aware of from the PRs |
@Thireus Not sure I understand your results. Initially using CUDA graphs basically doubles TG performance in your setup. But then it becomes just 9% in the latest tests? But looking at the graphs you had posted here, it seems |
@ikawrakow - The last results are at 130K context size, processing a 104K prompt. I'll run sweep-bench, it will make things clearer hopefully. |
@ikawrakow - See results below: ![]() ![]() Around 35k context size this is where Cuda Graphs on ik_llama.cpp stop making a big difference. Also interesting to see that around 30k context size, llama.cpp PP speed becomes faster than ik_llama.cpp. |
Seems when using hybrid inference tg was about 10% slower than last version on ubuntu. cmake -B ./build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="120" -DGGML_SCHED_MAX_COPIES=1 -DGGML_BLAS=OFF /home/ee/ik_llama.cpp/ik_llama.cpp/build/bin/llama-server |
If so, add |
However my graphic card is blackwell(on cuda 13.0) not turing. |
Then I need to update the description to not specifically mention Turing. Until now we had one user report performance degradation with graphs, now we have two. |
@oovloveme Btw, I would try using |
Wouldn't disabling cuda graphs just give lower performance if they had cuda graphs on before? |
No, because CUDA graphs were disabled for MoE models before this PR, and for hybrid inference the impact is small to none as you found out yourself. |
@ikawrakow - You're right, something is up with the GLM-4.5 implementation. Here's Qwen3-235B-A22B-Thinking-2507 with more consistent results: Recipe used:
|
This has now been fixed. See #700 (comment) |
} else { | ||
// token is control, but not marked as EOG -> print a debug log | ||
if (id_to_token[t.second].attr & LLAMA_TOKEN_ATTR_CONTROL && special_eog_ids.count(t.second) == 0) { | ||
LLAMA_LOG_DEBUG("%s: control token: %6d '%s' is not marked as EOG\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
It adds 816 lines to my model load of Deepseek (single line example: load: control token: 128713 '<|place▁holder▁no▁713|>' is not marked as EOG
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, LLAMA_LOG_DEBUG
is supposed to be active only in debug builds. Forgot to fix it. But it would be also good to understand if it is bad that these tokens are not marked as EOG and, if it is, understand why they are not marked as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I don't fully understand what it is printing. It says on one of the 816 lines:
load: control token: 1 '<|end▁of▁sentence|>' is not marked as EOG
but then later on load:
load: printing all EOG tokens:
load: - 1 ('<|end▁of▁sentence|>')
and I have used this model file extensively, it does treat that as an EOG.
This PR enables CUDA graphs for MoE models (TG only, just like in mainline
llama.cpp
).The implementation is largely based on mainline, but given the massive divergence between the two code bases and the fact that CUDA graphs in mainline are the result of many PR's, I couldn't cherry-pick, so it is copy/adjust.
Unlike earlier CUDA graph incarnations that I have tried, this time I'm observing a non-negligible TG performance gains on Linux. Given recent reports about mainline's TG performance being better than
ik_llama.cpp
on Windows, my guess is that on Windows the impact will be significantly higher.I have tested with 3 different MoE models that I can fully offloaded on my RTX-4080 GPU (GPT-OSS-20B, Qwen3-30B-A3B, DeepSeek-Lite-16B).
Worth noting that. Updated:mla = 0
andmla = 2
will no longer work for DeepSeek models as some pieces are still missing when CUDA graphs are enabled. But given thatmla = 3
is the recommended option and the fact that usemla = 0
ormla = 2
was discouraged quite some time ago, this should be OK-mla 1,3
is required for DeepSeek models only when not usingf16
KV cache.Note: this PR has been branched of the unmerged PR #683, not the main branch, and hence it included GPT-OSS support. This is also the reason the change is so large (it includes the +7083/-4096 changes from #683)
Important
For MoE models
-fmoe
is required, else graphs will be disabled.Important
There is a report of reduced performance with CUDA graphs here and here. If you observe lower performance after this PR has been merged, you can disable CUDA graphs by adding
-DGGML_CUDA_USE_GRAPHS=OFF
to thecmake
build command.@Thireus I would appreciate testing this PR with your Windows setup with GLM-4.5-Air (and other MoE models).
Here are some performance comparisons on RTX-4080.
DeepSeek-Lite-16B, Q4_0
Qwen3-30B-A3B, Q2_K_S
GPT-OSS-20B, MXFP4