-
Notifications
You must be signed in to change notification settings - Fork 154
DeepSeek CUDA Flash Attention #241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
No TG yet, but for PP I can run FA with fp16 cache and it gets the same answer.
I'm by no means a CUDA programming expert, so I thought it is interesting to see if a CUDA beginner can compete with
For Why are the |
Cooking! Serious good work. I don't believe there's any package that has FA implemented like this yet. |
This PR from mainline llama.cpp may help with implementing MLA FA ggml-org/llama.cpp#12227 |
Ha, this is exactly what I wanted to avoid and have avoided in the CPU implementation (unnecessarily crunching numbers to only throw them away). The "head" dimensions with MLA are 576 (K) and 512 (V). What the PR does is to use 576 for K and V, and then cuts away the last 64 elements in each row of the FA result. As the multiplication with V with |
That makes sense. I did see your current implementation is different than the approach this PR takes. Just said I’d reference it in case it would be useful! |
I'd hold off and see what @JohannesGaessler says, as the CUDA version either doesn't like the "Multi-Query Attention" (MQA) (ie: 1 K/V for 128 Q) and/or the 576 head dimension, as FA is using huge amounts of compute compared to non-FA at the same context... The non-FA half of the PR might be useful for |
It's running absolutely horrible at long contexts for CUDA - way way worse than these extra 64 values would cause. |
I kept those on purpose. This allows to batch-process |
For the split buffers specifically my long-term goal is to move the parallelization logic to the ggml graph level. I intend to do this when optimizing training performance (so probably at some point in the next 12 months). After that the code should become more simpler and easier to work with. |
But people want to run DeepSeek now and not in 12 months 😄 |
This looks like a good alternative to reducing memory use if ultimately a head size of 576 isn't feasible. I've currently just been dropping |
This leads to horrible performance for MoE models, especially MoE models such as DoeepSeekV3/R1. Just think about it: the default |
For what it’s worth, works incredibly well
For what it’s worth, this works incredibly well! Can see some generation stats here #237 |
Yeah, it's not quite as bad for me though as I found that even with This means I only start to see horrible performance drops when I have to drop to a double-digit I still like your method better though and agree it is vastly preferable to dropping One other thing I've noticed with large contexts and |
Yeah, I can see this being really useful and a good alternative to using FA if you are low on VRAM. |
This PR makes the CUDA FA implementation work when the V head size is not the same as the K head size (e.g., DeepSeek-Lite/V3/R1).
For TG I had to set the FA precision to
F32
, else we get gibberish. Not sure if it is really a matter of insufficient precision, or if I have missed something in thef16
vector kernel.The PR implements FA just for standard attention. FA for MLA is left for a follow up PR.
Here the mandatory performance comparisons. Model is
IQ4_NL
quantized DeepSeek-Lite, GPU is RTX-4080.First prompt processing as a function of prompt length. It is a MoE model where it is better to use larger
u_batch
sizes, so all calculations are foru_batch = 2048
, except no-FA forpp16384
where I had to useu_batch = 1024
to not run out of GPU memory.Nice gains increasing with prompt length.
Here is TG performance for 128 tokens as a function of tokens in the KV cache (preceding prompt length):
Here the gains are very modest and, somewhat surprisingly, do not increase with KV cache size. I suspect the kernel is FA TG kernel is sub-optimal. It was inherited from mainline
llama.cpp
and all I did is adjust the kernel template parameterD
(head size) to be eitherDk
(K head size) orDv
(V head size) depending on context. A better kernel forDk != Dv
is left for another day. For now we enjoy the benefit of much reduced compute buffer size.To limit the already excessive CUDA build time, I have only allowed K- and V-cache both
fp16
orQ8_0
.