-
Notifications
You must be signed in to change notification settings - Fork 155
Vulkan: a fresh start #608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks again for fussing with the vulkan stuff, so some good news: NVIDIA 3090TI
AMD 7900 XTX
|
|
Thanks for testing. With |
for coopmat1. Without this fix we get an assert. We get the same assert in mainline too.
|
Last commit fixes the assert. |
|
This PR fixed the issues I have with Vulkan flash attention. Thanks! |
|
You may want to take a look at this commit. Without this change mainline |
|
Thanks, I made a different fix upstream (see ggml-org/llama.cpp#14683). I noticed FA is failing for the scalar/coopmat1 paths with this model, but working for coopmat2. Did you happen to have a fix for that? |
Failing in what sense? I haven't tested scalar, but coopmat1 and coopmt2 seem to be working here. |
|
I got nonsense output running llama-cli with deepseek and FA enabled. But the backend tests all pass. |
|
I cannot say that I like the responses with coopmat1, but at least it is not gibberish. The above PPL test shows a 0.06 diff between coopmat1 and coopmat2, which is too large to be just numerical roundoff. So, I guess, something is not quite right. I did notice that the Vulkan FA does not work at all with Oh, to answer the question: no, I don't't have a fix. |


It looks like something in the Vulkan back-end got broken while porting from mainline and/or adding changes. As I wasn't able to see what could be wrong, I decided to start from scratch from mainline tag
b5891, and then add the 3ik_llama.cppfused ops not present in mainline. This PR is the result.To minimize differences for easier comparisons in the future
ik_llama.cppinstead of deleting them.It does seem to work for me, but I would appreciate more comprehensive testing fro @ubergarm, @firecoperana, and others.
Two, I think, interesting observations:
fp32. There is a difference between mainline andik_llama.cppin that regard. Mainline now just sets the precision tofp32, while inik_llama.cppthis is only done for a select set of models. This may have been the actual reason for observing NaNs and gibberish. As I'm not ready to throw in the towel as mainline did at some point, I have changed the attention implementation to set the precision tofp32if it is one of the models known to require it, or if the Vulkan backend is enabled. This will have the negative effect of also affecting CUDA, if someone decided to build with CUDA and Vulkan enabled, so probably it would be better to move this into the Vulkan backend itself (but this is left for a future PR as needed).mla = 1andmla = 3(see Vulkan: flash attention for DeepSeek models #584). With this PR I do see, as expected, a significantly higher PP performance withmla = 3(e.g., for a context of 16k tokens on an RTX-4080 with coopmat2 enabled, 1470 t/s withmla = 3vs 1086 t/s withmla = 1.