Commit 30536ee
FlashMLA-3 for DeepSeek models on CUDA (#386)
* CUDA WIP: support for FlashMLA-3
* Much better
The issue was that I did not change the number of warps
used for 3D matrix multiplications (wk_b * kv_cache, MoE),
so we ended up using 4 warps for TG. By going to 1 warp
in these cases, we get a significant boost in TG performance
(tested with DeepSeek-Lite)
* Sadly, the previous commit was wrong
* Finalizing
* Also add these
* Minor
* Minor tweak
---------
Co-authored-by: Iwan Kawrakow <[email protected]>1 parent 17c6fc6 commit 30536ee
File tree
5 files changed
+1798
-45
lines changed- ggml/src
- ggml-cuda
5 files changed
+1798
-45
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3587 | 3587 | | |
3588 | 3588 | | |
3589 | 3589 | | |
| 3590 | + | |
| 3591 | + | |
| 3592 | + | |
| 3593 | + | |
| 3594 | + | |
3590 | 3595 | | |
3591 | 3596 | | |
3592 | 3597 | | |
| |||
0 commit comments