-
Notifications
You must be signed in to change notification settings - Fork 162
DeepSeek: enable option to merge Q and K tensors #941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Didn't see much of a boost or any negative side effects. Tested IQ2 V3. |
|
Thanks for testing. Did you use |
|
for short context and cpu only, |
When running CPU-only |
|
this on my zen4 cpu test, I will try some more context length like over 10K. |
|
Yes, the change in performance should not depend on the KV cache type. But I'm surprised your Zen4 CPU has a lower performance for |
|
9454P, |
|
|
|
thanks for the tips, test this pr without
with
|
|
Yep, I put mqkv. on the other topic with my xeons: A lot of these tweaks have been minor on their own and then I put them all on one day and gain a t/s or two. Individually they are often lost in the noise of the sweep bench. BTW, llama-bench is segfaulting with deepseek for some reason. |
|
Yes, on the CPU it may not bring any benefits. It is mostly for inference with full GPU offload when the cost of kernel launch is not negligible compared to the kernel processing time (i.e., for not too large models). But at least it looks like I haven't broken the graph building, which is good news. |
Can you run and then say |
|
|
Thanks! Just to make sure the crash is with this PR? It crashes when building the graph. Not sure I understand why it works for @calvin2021y but crashes for you. And I understand even less why it crashes for you in |
|
I noticed it after this PR but I think a little earlier. I've been trying to build the chart from: #910 I successfully did it for glm but not for deepseek. |
The DeepSeek self attention mechanism is quite different from other models, so merging "Q" and "K" model tensors is also much trickier than doing that for standard self attention. But I was curious to see if it can be done, and this PR shows that it is possible.
For DeepSeek-Lite fully offloaded this gives 1.5-2% benefit for TG performance.
I cannot test with the larger siblings (R1/V3/Kimi2), so not sure if I have not broken something as there is one additional matrix multiplication involved, and it is easy to make a mistake with the views into the result of the merged matrix multiplication.
As with other Q/K/V merges, enabling this will disable
mmap.The option is disabled by default and gets enabled by
-mqkv.