11<!-- -
22 <img src='https://github.com/user-attachments/assets/9306862b-2a30-4a87-bb33-0fde9e9d7cea' width=250 >
3- <a href="#cuda-kernel">📚200+ CUDA Kernels</a> | <a href="#my-blogs-part-1"> 📚100+ LLM/CUDA Blogs</a> | <a href="#hgemm-tensorcores-mma -bench"> ⚡️HGEMM MMA</a> | <a href="#fa-mma-bench"> ⚡️FA-2 MMA </a> <p>
3+ <a href="#cuda-kernel">📚200+ CUDA Kernels</a> | <a href="#my-blogs-part-1"> 📚100+ LLM/CUDA Blogs</a> | <a href="#HGEMM -bench"> ⚡️HGEMM MMA</a> | <a href="#fa-mma-bench"> ⚡️FA-2 MMA </a> <p>
44<img src='https://github.com/user-attachments/assets/b2578723-b7a7-4d8f-bcd1-5008947b808a' >
55
66--->
3131## 📖 News 🔥🔥
3232<div id =" news " ></div >
3333
34- - [ 2025-01-08] : [ 📚Split Q + Fully QKV Fine-grained Tiling] ( #mma-tiling-qkv ) has been refactored into 🤖[ ffpa-attn-mma ] ( https://github.com/xlite-dev/ffpa-attn-mma .git ) : 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, ** 1.8x~ 3x** 🎉faster than SDPA EA: [ 📈L20 ~ 1.9x↑🎉] ( https://github.com/xlite-dev/ffpa-attn-mma ?tab=readme-ov-file#L1-bench-l20 ) , [ 📈 A30 ~ 1.8x↑🎉] ( https://github.com/xlite-dev/ffpa-attn-mma ?tab=readme-ov-file#L1-bench-a30 ) , [ 📈3080 ~ 2.9x↑🎉] ( https://github.com/xlite-dev/ffpa-attn-mma ?tab=readme-ov-file#L1-bench-3080 ) , [ 📈4090 ~ 2.1x↑🎉] ( https://github.com/xlite-dev/ffpa-attn-mma ?tab=readme-ov-file#L1-bench-4090 ) .
34+ - [ 2025-01-08] : [ 📚Split Q + Fully QKV Fine-grained Tiling] ( #mma-tiling-qkv ) has been refactored into 🤖[ ffpa-attn] ( https://github.com/xlite-dev/ffpa-attn.git ) : 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, ** 1.8x~ 3x** 🎉faster than SDPA EA: [ 📈L20 ~ 1.9x↑🎉] ( https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-l20 ) , [ 📈 A30 ~ 1.8x↑🎉] ( https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-a30 ) , [ 📈3080 ~ 2.9x↑🎉] ( https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-3080 ) , [ 📈4090 ~ 2.1x↑🎉] ( https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-4090 ) .
3535
3636<div align =' center ' >
3737 <img src =' https://github.com/user-attachments/assets/cba2edce-ac0d-412e-823c-7eea2cc63f83 ' height =" 170px " width =" 270px " >
3838 <img src =' https://github.com/user-attachments/assets/447e2937-f7c8-47c8-8550-8c0c71b910e6 ' height =" 170px " width =" 270px " >
3939 <img src =' https://github.com/user-attachments/assets/65a8d564-8fa7-4d66-86b9-e238feb86143 ' height =" 170px " width =" 270px " >
4040</div >
4141
42- - [ 2024-12-02] : HGEMM MMA kernels has been refactored into 🤖[ hgemm-tensorcores-mma ] ( https://github.com/xlite-dev/hgemm-tensorcores-mma .git ) : ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, achieve peak⚡️ performance.
42+ - [ 2024-12-02] : HGEMM MMA kernels has been refactored into 🤖[ HGEMM ] ( https://github.com/xlite-dev/HGEMM .git ) : ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, achieve peak⚡️ performance.
4343
4444<div align =' center ' >
4545 <img src =' https://github.com/user-attachments/assets/71927ac9-72b3-4ce9-b0e2-788b5885bc99 ' height =" 170px " width =" 270px " >
4949
5050## 📖 Contents
5151<div id =" contents " ></div >
52-
53- - [ 📖 How to Contribute? 👀👇] ( #contribute )
54- - [ 📖 HGEMM-MMA 🎉🎉] ( #hgemm-tensorcores-mma-bench )
55- - [ 📚 CUDA/Tensor Cores] ( #hgemm-tensorcores-mma-bench )
56- - [ 📚 Tile Block(Br, Bc)] ( #hgemm-tensorcores-mma-bench )
57- - [ 📚 Tile MMAs/Warps] ( #hgemm-tensorcores-mma-bench )
58- - [ 📚 Pack LDST(128 bits)] ( #hgemm-tensorcores-mma-bench )
59- - [ 📚 Multi Stages(2~ 4)] ( #hgemm-tensorcores-mma-bench )
60- - [ 📚 Block/Warp Swizzle] ( #hgemm-tensorcores-mma-bench )
61- - [ 📚 SMEM Swizzle] ( #hgemm-tensorcores-mma-bench )
62- - [ 📚 Register Double Buffers] ( #hgemm-tensorcores-mma-bench )
63- - [ 📚 Collective Store(Shfl)] ( #hgemm-tensorcores-mma-bench )
64- - [ 📚 Layout NN/TN] ( #hgemm-tensorcores-mma-bench )
52+ <!-- -
53+ - [📚 CUDA/Tensor Cores](#HGEMM-bench)
54+ - [📚 Tile Block(Br, Bc)](#HGEMM-bench)
55+ - [📚 Tile MMAs/Warps](#HGEMM-bench)
56+ - [📚 Pack LDST(128 bits)](#HGEMM-bench)
57+ - [📚 Multi Stages(2~4)](#HGEMM-bench)
58+ - [📚 Block/Warp Swizzle](#HGEMM-bench)
59+ - [📚 SMEM Swizzle](#HGEMM-bench)
60+ - [📚 Register Double Buffers](#HGEMM-bench)
61+ - [📚 Collective Store(Shfl)](#HGEMM-bench)
62+ - [📚 Layout NN/TN](#HGEMM-bench)
6563- [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench)
6664 - [📚 Split KV (Basic, FA-1)](#mma-split-kv)
6765 - [📚 Split Q (Faster, FA-2)](#mma-split-q)
6866 - [📚 Split Q + Shared KV](#mma-share-kv)
6967 - [📚 Split Q + Shared QKV](#mma-share-qkv)
7068 - [📚 Split Q + QK Tiling](#mma-tiling-qk)
7169 - [📚 Split Q + QKV Tiling](#mma-tiling-qkv)
70+ - [📖 How to Contribute? 👀👇](#contribute)
71+ - [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench)
72+ - [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench)
7273- [📖 200+ CUDA Kernels 🔥🔥](#cuda-kernel)
7374 - [📚 Easy ⭐️](#cuda-kernel-easy-medium)
7475 - [📚 Medium ⭐️⭐️](#cuda-kernel-easy-medium)
8687 - [📚 CuTe系列详解与实践](#other-blogs)
8788 - [📚 GPU指令集架构精解](#other-blogs)
8889 - [📚 GPU通信架构精解](#other-blogs)
90+ -->
91+
92+ - [ 📖 HGEMM-MMA 🎉🎉] ( #HGEMM-bench )
93+ - [ 📖 FlashAttention-MMA 🎉🎉] ( #fa-mma-bench )
94+ - [ 📖 200+ CUDA Kernels 🔥🔥] ( #cuda-kernel )
95+ - [ 📖 100+ 高性能计算文章 💡💡] ( #my-blogs-part-1 )
96+ - [ 📖 How to Contribute 👀👇] ( #contribute )
8997
9098## 📖 HGEMM Benchmark 🎉🎉
9199
92- <div id =" hgemm-tensorcores-mma -bench" ></div >
100+ <div id =" HGEMM -bench" ></div >
93101
94- Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores algorithm, the ` HGEMM (WMMA/MMA/CuTe) ` in this repo (` blue ` 🔵) can achieve ` 98%~100% ` of its (` orange ` 🟠) performance. Please check [ toy-hgemm library⚡️⚡️] ( ./kernels/hgemm ) or [ hgemm-tensorcores-mma ⚡️⚡️] ( https://github.com/xlite-dev/hgemm-tensorcores-mma ) repo for more details.
102+ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores algorithm, the ` HGEMM (WMMA/MMA/CuTe) ` in this repo (` blue ` 🔵) can achieve ` 98%~100% ` of its (` orange ` 🟠) performance. Please check [ toy-hgemm library⚡️⚡️] ( ./kernels/hgemm ) or [ HGEMM ⚡️⚡️] ( https://github.com/xlite-dev/HGEMM ) repo for more details.
95103
96104![ toy-hgemm-library] ( https://github.com/user-attachments/assets/962bda14-b494-4423-b8eb-775da9f5503d )
97105
@@ -118,7 +126,7 @@ I have also implemented **FlashAttention-2** using pure MMA PTX instructions, wh
118126| ✔️Tile Warps| ✔️Multi Stages(1/2)| ✔️Collective Store(Shfl)| ✔️** Split KV/Q** |
119127| ✔️** Shared QKV** SMEM| ✔️** Prefetch Q** s2r| ✔️** Prefetch KV** g2s| ✔️** QKV Fine-grained Tiling** |
120128
121- Currently, for small-scale attention ` (B<=4, H <=48, SeqLen <= 8192, D <= 64) ` it can run faster than FA2/SDPA on some Devices. For example, on NVIDIA RTX 3080 Laptop, [ 📚 Split Q + Fully Shared QKV SMEM] ( #mma-share-qkv ) method can achieve ** 55 TFLOPS (D=64)** that almost ** ~ 1.5x** 🎉 faster than FA2. On NVIDIA L20, 🤖[ ffpa-attn-mma ] ( https://github.com/xlite-dev/ffpa-attn-mma ) method can achieve ** 104 TFLOPS (D=512)** that almost ** ~ 1.8x** 🎉 faster than SDPA (EFFICIENT ATTENTION). However, for large-scale attention, there remains a performance gap. Stay tuned for updates ~ (MMA Acc F16/F32, softmax Acc F32 vs FA2 MMA/softmax Acc F32, 👇Benchmark)
129+ Currently, for small-scale attention ` (B<=4, H <=48, SeqLen <= 8192, D <= 64) ` it can run faster than FA2/SDPA on some Devices. For example, on NVIDIA RTX 3080 Laptop, [ 📚 Split Q + Fully Shared QKV SMEM] ( #mma-share-qkv ) method can achieve ** 55 TFLOPS (D=64)** that almost ** ~ 1.5x** 🎉 faster than FA2. On NVIDIA L20, 🤖[ ffpa-attn] ( https://github.com/xlite-dev/ffpa-attn ) method can achieve ** 104 TFLOPS (D=512)** that almost ** ~ 1.8x** 🎉 faster than SDPA (EFFICIENT ATTENTION). However, for large-scale attention, there remains a performance gap. Stay tuned for updates ~ (MMA Acc F16/F32, softmax Acc F32 vs FA2 MMA/softmax Acc F32, 👇Benchmark)
122130
123131| Algorithm| (B,H,N,D) | RTX 3080 Laptop | L20 | RTX 4090 |
124132| :---:| :---:| :---:| :---:| :---:|
@@ -127,7 +135,7 @@ Currently, for small-scale attention `(B<=4, H <=48, SeqLen <= 8192, D <= 64)` i
127135| FlashAttention-2| (1,48,8192,64)| 37 TFLOPS| 109 TFLOPS| 163 TFLOPS|
128136| share-qkv+stage2| (1,48,8192,64)| ** 48 TFLOPS** | 107 TFLOPS| ** 224 TFLOPS** |
129137| SDPA(EFFICIENT ATTENTION)| (1,48,8192,512)| 16 TFLOPS| 58 TFLOPS| 85 TFLOPS|
130- | 🤖[ ffpa-attn-mma ] ( https://github.com/xlite-dev/ffpa-attn-mma ) | (1,48,8192,512)| ** 39 TFLOPS** | ** 104 TFLOPS** | ** 200 TFLOPS** |
138+ | 🤖[ ffpa-attn] ( https://github.com/xlite-dev/ffpa-attn ) | (1,48,8192,512)| ** 39 TFLOPS** | ** 104 TFLOPS** | ** 200 TFLOPS** |
131139| Precision Errors vs FA2/SDPA| / | max: < ~ 1e-3 | min: ~ 0.0 | mean: < ~ 1e-5 |
132140
133141The ` Split KV ` and ` Split Q ` implementations have been carried out in [ flash-attention-mma⚡️⚡️] ( ./kernels/flash-attn ) for performance comparison. The ` Split KV ` method, which involves splitting all QKV across MMA (Warps), is slower than ` Split Q ` method, which splitting Q across MMA(Warps) and keep access KV for all MMA(Warps).
@@ -208,7 +216,7 @@ __global__ void // Q, K, V, O -> [B, H, N, D]
208216flash_attn_mma_stages_split_q_tiling_qkv_kernel(half* Q, half* K, half* V, half* O, ...);
209217```
210218
211- 💡NOTE: [ 📚Split Q + Fully QKV Fine-grained Tiling] ( #mma-tiling-qkv ) has been refactored into 🤖[ ffpa-attn-mma ] ( https://github.com/xlite-dev/ffpa-attn-mma ) .
219+ 💡NOTE: [ 📚Split Q + Fully QKV Fine-grained Tiling] ( #mma-tiling-qkv ) has been refactored into 🤖[ ffpa-attn] ( https://github.com/xlite-dev/ffpa-attn ) .
212220
213221## ©️Citations🎉🎉
214222
@@ -453,17 +461,17 @@ The kernels listed here will guide you through a step-by-step progression, rangi
453461
454462| 📖 CUDA Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level |
455463| :---| :---| :---| :---| :---|
456- | ✔️ [ ffpa_mma_stages_split_q_L1_F16F16F16] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L1.cu ) | f16| f16| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
457- | ✔️ [ ffpa_mma_stages_split_q_L1_F16F16F32] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu ) | f16| f32| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
458- | ✔️ [ ffpa_mma_stages_split_q_L1_mixed_acc] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu ) | f16| QK f32, PV f16| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
459- | ⚠️ [ ffpa_mma_stages_split_q_L2_F16F16F16] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L2.cu ) | f16| f16| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
460- | ⚠️ [ ffpa_mma_stages_split_q_L2_F16F16F32] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu ) | f16| f32| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
461- | ⚠️ [ ffpa_mma_stages_split_q_L2_mixed_acc] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu ) | f16| QK f32, PV f16| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
462- | ⚠️ [ ffpa_mma_stages_split_q_L3_F16F16F16] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L3.cu ) | f16| f16| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
463- | ⚠️ [ ffpa_mma_stages_split_q_L3_F16F16F32] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu ) | f16| f32| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
464- | ⚠️ [ ffpa_mma_stages_split_q_L3_mixed_acc] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu ) | f16| QK f32, PV f16| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
465-
466- 💡NOTE: 🤖[ ffpa-attn-mma ] ( https://github.com/xlite-dev/ffpa-attn-mma ) : 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, ** 1.8x~ 3x** 🎉faster than SDPA EA: [ 📈L20 ~ 1.9x↑🎉] ( https://github.com/xlite-dev/ffpa-attn-mma ?tab=readme-ov-file#L1-bench-l20 ) , [ 📈 A30 ~ 1.8x↑🎉] ( https://github.com/xlite-dev/ffpa-attn-mma ?tab=readme-ov-file#L1-bench-a30 ) , [ 📈3080 ~ 2.9x↑🎉] ( https://github.com/xlite-dev/ffpa-attn-mma ?tab=readme-ov-file#L1-bench-3080 ) , [ 📈4090 ~ 2.1x↑🎉] ( https://github.com/xlite-dev/ffpa-attn-mma ?tab=readme-ov-file#L1-bench-4090 ) .
464+ | ✔️ [ ffpa_mma_stages_split_q_L1_F16F16F16] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L1.cu ) | f16| f16| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
465+ | ✔️ [ ffpa_mma_stages_split_q_L1_F16F16F32] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu ) | f16| f32| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
466+ | ✔️ [ ffpa_mma_stages_split_q_L1_mixed_acc] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu ) | f16| QK f32, PV f16| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
467+ | ⚠️ [ ffpa_mma_stages_split_q_L2_F16F16F16] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L2.cu ) | f16| f16| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
468+ | ⚠️ [ ffpa_mma_stages_split_q_L2_F16F16F32] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu ) | f16| f32| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
469+ | ⚠️ [ ffpa_mma_stages_split_q_L2_mixed_acc] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu ) | f16| QK f32, PV f16| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
470+ | ⚠️ [ ffpa_mma_stages_split_q_L3_F16F16F16] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L3.cu ) | f16| f16| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
471+ | ⚠️ [ ffpa_mma_stages_split_q_L3_F16F16F32] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu ) | f16| f32| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
472+ | ⚠️ [ ffpa_mma_stages_split_q_L3_mixed_acc] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu ) | f16| QK f32, PV f16| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
473+
474+ 💡NOTE: 🤖[ ffpa-attn] ( https://github.com/xlite-dev/ffpa-attn ) : 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, ** 1.8x~ 3x** 🎉faster than SDPA EA: [ 📈L20 ~ 1.9x↑🎉] ( https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-l20 ) , [ 📈 A30 ~ 1.8x↑🎉] ( https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-a30 ) , [ 📈3080 ~ 2.9x↑🎉] ( https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-3080 ) , [ 📈4090 ~ 2.1x↑🎉] ( https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-4090 ) .
467475
468476### 📚 Triton Kernel (OpenAI Triton) ([ ©️back👆🏻] ( #cuda-kernel ) )
469477
0 commit comments