Replies: 2 comments 12 replies
-
Since you are tagging me: I did look at the more general implementation for mapping MoE to regular matrix multiplications in the PR where I commented but I did not look at any MoE-specific CUDA code for matrix vector multiplication, nor was I aware that this repository had such an optimization. It's just the natural way of writing a fused kernel. |
Beta Was this translation helpful? Give feedback.
-
I read this and the warning on the README.md about incompatible GGUFs is quite unfortunate. I don't mind spending the time to create my own quants for this fork in the pursuit of maximum performance. I am a total noob to creating quants, however. I am building an EPYC box with 768 GB RAM and 96 GB VRAM (2x48). Will I be able to use scripts to conveniently convert such releases as DeepSeek V3/R1 or the curious tngtech/DeepSeek-R1T-Chimera model from safetensors? Do you plan to support the incompatible mainline GGUF files? Can I assume that GGUFs created before mid-April or so will be compatible? (Downloading these larger models represents a considerable cost.) Thank you for creating this work and making it available. You are a true wizard. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Intro
After several attempts, they have added MLA for DeepSeek models in mainline
llama.cpp
via this PR, and I was curious to see how it performs. They have of course made it maximally painful - one needs to re-download and re-convert the model to be able to take advantage of the MLA feature. Fortunately for me, on my hardware I can only run DeepSeek-Lite, i.e., a 32 GB download, so not too bad (but in comparison,ik_llama.cpp
allows usage of MLA with an original DeepSeek GGUF as the tensors necessary for MLA get created on-the-fly). Anyway, I'm on a 300 Mb/s connection, so 15 minutes later I'm up and running.What is the TL;DR? As the title already said - not all MLAs are born equal.
Setup
I'll be using a
Q4_0
quantized DeepSeek-Lite model for all comparison.Q4_0
is the fastest quantization type in mainline due to the extraordinary amount of attention it receives. GPU performance measurements are done on an RTX-4080 GPU. CPU performance is measured on a Ryzen-7950X CPU (and the RTX-4080 is in the Ryzen-7950X rig).CUDA performance
I was most curious about CUDA performance. Why? Because in this PR @JohannesGaessler has completely independently, without ever looking at ik_llama.cpp, discovered this optimization in
ik_llama.cpp
, so I wanted to know how the two implementations compare. Mainline does not support Flash Attention (FA) for DeepSeek on CUDA (due to K- and V-head sizes being different).ik_llama.cpp
uses FlashMLA-2.This graph shows CUDA TG performance as a function of
N_KV
, the number of tokens in the KV cache. ForN_KV = 0
, mainline is now about 15% faster thanik_llama.cpp
. This can be due to the fact that @JohannesGaessler is a much better GPU programmer than I'm, so has achieved a more optimized implementation. However, looking at the comments and performance measurements in the PR, a more likely explanation is the enabling of CUDA graphs for TG with MoE models in this PR (CUDA graphs are disabled inik_llama.cpp
for MoE models). But as soon as there are some tokens in the KV cache (the normal use case scenario),ik_llama.cpp
becomes faster. The performance gap grows with increasing KV cache size and reaches 1.8X at 32k tokens.The next graph compares CUDA PP performance as a function of
N_KV
foru_batch
size of 1024 tokens. The performance optimizations inik_llama.cpp
have not been independently discovered yet, so here performance gap is 1.85X for smallN_KV
, increasing to 2.5X at 32k tokens.llama.cpp CUDA performance data
ik_llama.cpp CUDA performance data
Perhaps also of interest is the extra VRAM required. For DeepSeek-Lite at 32k tokens mainline KV-cache size 1836 MiB, along with a CUDA compute buffer size of 2280 MiB, for a total of 4116 MiB. In comparison,
ik_llama.cpp
uses 972 MiV of K-cache (there is no V-cache required as it gets computed from the K-cache at the expense of some performance reduction) plus 936 MiB of CUDA compute buffer for a total of 1908 MiB, so 2.15X times less.CPU performance
Mainline does support FA on the CPU, but performance is quite bad, so I'm including mainline results with and without FA enabled. When FA is enabled, the KV cache is quantized with
Q8_0
.ik_llama.cpp
calculations are with FlashMLA-3, which is the best option for CPU inference.The following graph shows CPU TG performance as a function of
N_KV
. Here mainline FA is faster by about 3% when the KV cache is empty. This is an artifact of the way FA is implemented: the minimum size of the u-batch created is 256 tokens. When there is no actual context in the KV cache almost all tokens are masked away. Mainline's FA implementation checks for that and skips theK*Q
dot product for such tokens. I have not bothered adding this optimization toik_llama.cpp
as it never is useful in actual usage (when the KV cache is not empty). With any contextik_llama.cpp
is faster. The performance gap increases with increasing number of tokens in the KV cache and reaches 39% (no FA) or 70% (FA) at 16k tokens.The next graph shows PP performance as a function of
N_KV
. Here the performance gap to mainline without FA is 2.87X for zero context, increasing to 4.5X at 16k tokens. When FA is enabled in mainline, it is 10X slower at 16k tokens.llama.cpp CPU performance data (FA disabled)
llama.cpp CPU performance data (FA enabled)
ik_llama.cpp CPU performance data
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | |-------|--------|--------|----------|----------|----------|----------| | 512 | 128 | 0 | 0.739 | 693.23 | 3.836 | 33.37 | | 512 | 128 | 512 | 0.769 | 665.76 | 3.931 | 32.56 | | 512 | 128 | 1024 | 0.817 | 626.90 | 3.958 | 32.34 | | 512 | 128 | 1536 | 0.869 | 589.09 | 3.991 | 32.07 | | 512 | 128 | 2048 | 0.912 | 561.30 | 4.037 | 31.71 | | 512 | 128 | 2560 | 0.967 | 529.68 | 4.087 | 31.32 | | 512 | 128 | 3072 | 1.020 | 502.07 | 4.146 | 30.87 | | 512 | 128 | 3584 | 1.087 | 470.96 | 4.182 | 30.61 | | 512 | 128 | 4096 | 1.132 | 452.35 | 4.235 | 30.22 | | 512 | 128 | 4608 | 1.189 | 430.73 | 4.290 | 29.84 | | 512 | 128 | 5120 | 1.247 | 410.52 | 4.351 | 29.42 | | 512 | 128 | 5632 | 1.304 | 392.59 | 4.426 | 28.92 | | 512 | 128 | 6144 | 1.363 | 375.64 | 4.508 | 28.39 | | 512 | 128 | 6656 | 1.420 | 360.52 | 4.584 | 27.92 | | 512 | 128 | 7168 | 1.485 | 344.78 | 4.665 | 27.44 | | 512 | 128 | 7680 | 1.542 | 332.04 | 4.751 | 26.94 | | 512 | 128 | 8192 | 1.605 | 318.99 | 4.821 | 26.55 | | 512 | 128 | 8704 | 1.669 | 306.76 | 4.736 | 27.02 | | 512 | 128 | 9216 | 1.736 | 294.93 | 4.773 | 26.82 | | 512 | 128 | 9728 | 1.802 | 284.05 | 4.832 | 26.49 | | 512 | 128 | 10240 | 1.865 | 274.57 | 4.889 | 26.18 | | 512 | 128 | 10752 | 1.927 | 265.65 | 4.949 | 25.87 | | 512 | 128 | 11264 | 1.994 | 256.77 | 5.015 | 25.53 | | 512 | 128 | 11776 | 2.063 | 248.24 | 5.074 | 25.23 | | 512 | 128 | 12288 | 2.127 | 240.67 | 5.139 | 24.91 | | 512 | 128 | 12800 | 2.194 | 233.39 | 5.207 | 24.58 | | 512 | 128 | 13312 | 2.262 | 226.33 | 5.272 | 24.28 | | 512 | 128 | 13824 | 2.326 | 220.10 | 5.342 | 23.96 | | 512 | 128 | 14336 | 2.389 | 214.35 | 5.399 | 23.71 | | 512 | 128 | 14848 | 2.456 | 208.43 | 5.461 | 23.44 | | 512 | 128 | 15360 | 2.522 | 203.02 | 5.511 | 23.23 | | 512 | 128 | 15872 | 2.590 | 197.72 | 5.573 | 22.97 |Beta Was this translation helpful? Give feedback.
All reactions