Skip to content

Commit 1a1c991

Browse files
authored
[README] Add cuffpa-py library News🔥 (#214)
* Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md
1 parent 82f1d04 commit 1a1c991

File tree

1 file changed

+27
-6
lines changed

1 file changed

+27
-6
lines changed

README.md

Lines changed: 27 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,18 +12,35 @@
1212
<img src=https://img.shields.io/badge/License-GPLv3.0-turquoise.svg >
1313
</div>
1414

15-
<div id="contents"></div>
1615

1716
📚 **Modern CUDA Learn Notes with PyTorch** for Beginners: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖150+ CUDA Kernels🔥🔥(Easy -> Hard++)](#cuda-kernel) with PyTorch bindings, [📖100+ LLM/VLM/CV/CUDA/CuTe🔥](#my-blogs-part-1) blogs, [📖toy-hgemm⚡️⚡️](./kernels/hgemm) which can achieve `98%~100%` performance of **cuBLAS**, and [📖flash-attention-mma⚡️⚡️](./kernels/flash-attn) using Tensor Cores with pure MMA PTX. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉
1817

19-
<div id="hgemm-sgemm"></div>
18+
## 📖 News 🔥🔥
19+
<div id="news"></div>
20+
21+
- [2025-01-08]: [📚Fully QKV Fine-grained Tiling](#mma-tiling-qkv) has been refactored into 🤖[cuffpa-py](https://github.com/DefTruth/cuffpa-py): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, ~1.5x🎉faster vs SDPA EA.
22+
- [2024-12-02]: HGEMM MMA kernels has been refactored into 🤖[hgemm-tensorcores-mma](https://github.com/DefTruth/hgemm-tensorcores-mma): ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API.
23+
24+
## 📖 Contents👇👀
25+
26+
<div id="contents"></div>
27+
28+
- [📖 HGEMM Benchmark](#hgemm-mma-bench)
29+
- [📖 FA2-MMA Benchmark](#fa-mma-bench)
30+
- [📖 150+ CUDA Kernels](#cuda-kernel)
31+
- [📖 100+ Blogs(LLM/CUDA)](#my-blogs-part-1)
32+
33+
## 📖 HGEMM-MMA Benchmark 🎉🎉
34+
35+
<div id="hgemm-mma-bench"></div>
2036

2137
<div align='center'>
2238
<img src='https://github.com/user-attachments/assets/71927ac9-72b3-4ce9-b0e2-788b5885bc99' height="170px" width="270px">
2339
<img src='https://github.com/user-attachments/assets/05ef4f5e-d999-48ea-b58e-782cffb24e85' height="170px" width="270px">
2440
<img src='https://github.com/user-attachments/assets/9472e970-c083-4b31-9252-3eeecc761078' height="170px" width="270px">
2541
</div>
2642

43+
2744
Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores algorithm, the `HGEMM (WMMA/MMA/CuTe)` in this repo (`blue`🔵) can achieve `98%~100%` of its (`orange`🟠) performance. Please check [toy-hgemm library⚡️⚡️](./kernels/hgemm) or [hgemm-tensorcores-mma⚡️⚡️](https://github.com/DefTruth/hgemm-tensorcores-mma) repo for more details.
2845

2946
![toy-hgemm-library](https://github.com/user-attachments/assets/962bda14-b494-4423-b8eb-775da9f5503d)
@@ -40,6 +57,9 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's d
4057
|Collective Store (Shfl)|Row Major (NN)|Col Major (TN)| SGEMM FP32/TF32|
4158
|✔️|✔️|✔️|✔️|
4259

60+
## 📖 FA2-MMA Benchmark 🎉🎉
61+
62+
<div id="fa-mma-bench"></div>
4363

4464
I have also implemented **FlashAttention-2** using pure MMA PTX instructions, which supports features such as Multi-Stages, Tile MMA, Tile Warp, Shared KV SMEM, **Fully Shared QKV SMEM**, **Prefetch Q s2r**, **Prefetch K/V g2s**, **QKV Fine-grained Tiling**, Collective Store, etc. Please refer to [flash-attention-mma⚡️⚡️](./kernels/flash-attn) for more details.
4565

@@ -131,7 +151,7 @@ __global__ void // Q, K, V, O -> [B, H, N, D]
131151
flash_attn_mma_stages_split_q_tiling_qk_kernel(half* Q, half* K, half* V, half* O, ...);
132152
```
133153
134-
- 📚 Split Q + Fully QKV Fine-grained Tiling (**O(Brx16)~O(1) SRAM** vs FA2 **O(4xBrxd) SRAM**)
154+
- 📚 Split Q + Fully QKV Fine-grained Tiling (**O(2xBrx16)~O(1) SRAM** vs FA2 **O(4xBrxd) SRAM**)
135155
136156
<div id="mma-tiling-qkv"></div>
137157
@@ -142,7 +162,6 @@ flash_attn_mma_stages_split_q_tiling_qk_kernel(half* Q, half* K, half* V, half*
142162
__global__ void // Q, K, V, O -> [B, H, N, D]
143163
flash_attn_mma_stages_split_q_tiling_qkv_kernel(half* Q, half* K, half* V, half* O, ...);
144164
```
145-
146165
## ©️Citations🎉🎉
147166

148167
```BibTeX
@@ -538,7 +557,7 @@ GNU General Public License v3.0
538557

539558
## 🎉Contribute ([©️back👆🏻](#contents))
540559

541-
<div id="Contribute"></div>
560+
<div id="contribute"></div>
542561

543562
How to contribute? Star this repo or check [🌤🌤CONTRIBUTE🎉🎉](https://github.com/DefTruth/CUDA-Learn-Notes/issues/50).
544563

@@ -552,7 +571,9 @@ How to contribute? Star this repo or check [🌤🌤CONTRIBUTE🎉🎉](https://
552571
</a>
553572
</div>
554573

555-
## 📖 References ([©️back👆🏻](#contents))
574+
## 📖 References ([©️back👆🏻](#contents))
575+
<div id="ref"></div>
576+
556577
- [flash-attention-minimal](https://github.com/tspeterkim/flash-attention-minimal)
557578
- [tiny-flash-attention](https://github.com/66RING/tiny-flash-attention)
558579
- [cute-gemm](https://github.com/reed-lau/cute-gemm)

0 commit comments

Comments
 (0)