1
1
<!-- -
2
2
<img src='https://github.com/user-attachments/assets/9306862b-2a30-4a87-bb33-0fde9e9d7cea' width=250 >
3
- <a href="#cuda-kernel">📚200+ CUDA Kernels</a> | <a href="#my-blogs-part-1"> 📚100+ LLM/CUDA Blogs</a> | <a href="#hgemm-tensorcores-mma -bench"> ⚡️HGEMM MMA</a> | <a href="#fa-mma-bench"> ⚡️FA-2 MMA </a> <p>
3
+ <a href="#cuda-kernel">📚200+ CUDA Kernels</a> | <a href="#my-blogs-part-1"> 📚100+ LLM/CUDA Blogs</a> | <a href="#HGEMM -bench"> ⚡️HGEMM MMA</a> | <a href="#fa-mma-bench"> ⚡️FA-2 MMA </a> <p>
4
4
<img src='https://github.com/user-attachments/assets/b2578723-b7a7-4d8f-bcd1-5008947b808a' >
5
5
6
6
--->
31
31
## 📖 News 🔥🔥
32
32
<div id =" news " ></div >
33
33
34
- - [ 2025-01-08] : [ 📚Split Q + Fully QKV Fine-grained Tiling] ( #mma-tiling-qkv ) has been refactored into 🤖[ ffpa-attn-mma ] ( https://github.com/xlite-dev/ffpa-attn-mma .git ) : 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, ** 1.8x~ 3x** 🎉faster than SDPA EA: [ 📈L20 ~ 1.9x↑🎉] ( https://github.com/xlite-dev/ffpa-attn-mma ?tab=readme-ov-file#L1-bench-l20 ) , [ 📈 A30 ~ 1.8x↑🎉] ( https://github.com/xlite-dev/ffpa-attn-mma ?tab=readme-ov-file#L1-bench-a30 ) , [ 📈3080 ~ 2.9x↑🎉] ( https://github.com/xlite-dev/ffpa-attn-mma ?tab=readme-ov-file#L1-bench-3080 ) , [ 📈4090 ~ 2.1x↑🎉] ( https://github.com/xlite-dev/ffpa-attn-mma ?tab=readme-ov-file#L1-bench-4090 ) .
34
+ - [ 2025-01-08] : [ 📚Split Q + Fully QKV Fine-grained Tiling] ( #mma-tiling-qkv ) has been refactored into 🤖[ ffpa-attn] ( https://github.com/xlite-dev/ffpa-attn.git ) : 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, ** 1.8x~ 3x** 🎉faster than SDPA EA: [ 📈L20 ~ 1.9x↑🎉] ( https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-l20 ) , [ 📈 A30 ~ 1.8x↑🎉] ( https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-a30 ) , [ 📈3080 ~ 2.9x↑🎉] ( https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-3080 ) , [ 📈4090 ~ 2.1x↑🎉] ( https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-4090 ) .
35
35
36
36
<div align =' center ' >
37
37
<img src =' https://github.com/user-attachments/assets/cba2edce-ac0d-412e-823c-7eea2cc63f83 ' height =" 170px " width =" 270px " >
38
38
<img src =' https://github.com/user-attachments/assets/447e2937-f7c8-47c8-8550-8c0c71b910e6 ' height =" 170px " width =" 270px " >
39
39
<img src =' https://github.com/user-attachments/assets/65a8d564-8fa7-4d66-86b9-e238feb86143 ' height =" 170px " width =" 270px " >
40
40
</div >
41
41
42
- - [ 2024-12-02] : HGEMM MMA kernels has been refactored into 🤖[ hgemm-tensorcores-mma ] ( https://github.com/xlite-dev/hgemm-tensorcores-mma .git ) : ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, achieve peak⚡️ performance.
42
+ - [ 2024-12-02] : HGEMM MMA kernels has been refactored into 🤖[ HGEMM ] ( https://github.com/xlite-dev/HGEMM .git ) : ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, achieve peak⚡️ performance.
43
43
44
44
<div align =' center ' >
45
45
<img src =' https://github.com/user-attachments/assets/71927ac9-72b3-4ce9-b0e2-788b5885bc99 ' height =" 170px " width =" 270px " >
49
49
50
50
## 📖 Contents
51
51
<div id =" contents " ></div >
52
-
53
- - [ 📖 How to Contribute? 👀👇] ( #contribute )
54
- - [ 📖 HGEMM-MMA 🎉🎉] ( #hgemm-tensorcores-mma-bench )
55
- - [ 📚 CUDA/Tensor Cores] ( #hgemm-tensorcores-mma-bench )
56
- - [ 📚 Tile Block(Br, Bc)] ( #hgemm-tensorcores-mma-bench )
57
- - [ 📚 Tile MMAs/Warps] ( #hgemm-tensorcores-mma-bench )
58
- - [ 📚 Pack LDST(128 bits)] ( #hgemm-tensorcores-mma-bench )
59
- - [ 📚 Multi Stages(2~ 4)] ( #hgemm-tensorcores-mma-bench )
60
- - [ 📚 Block/Warp Swizzle] ( #hgemm-tensorcores-mma-bench )
61
- - [ 📚 SMEM Swizzle] ( #hgemm-tensorcores-mma-bench )
62
- - [ 📚 Register Double Buffers] ( #hgemm-tensorcores-mma-bench )
63
- - [ 📚 Collective Store(Shfl)] ( #hgemm-tensorcores-mma-bench )
64
- - [ 📚 Layout NN/TN] ( #hgemm-tensorcores-mma-bench )
52
+ <!-- -
53
+ - [📚 CUDA/Tensor Cores](#HGEMM-bench)
54
+ - [📚 Tile Block(Br, Bc)](#HGEMM-bench)
55
+ - [📚 Tile MMAs/Warps](#HGEMM-bench)
56
+ - [📚 Pack LDST(128 bits)](#HGEMM-bench)
57
+ - [📚 Multi Stages(2~4)](#HGEMM-bench)
58
+ - [📚 Block/Warp Swizzle](#HGEMM-bench)
59
+ - [📚 SMEM Swizzle](#HGEMM-bench)
60
+ - [📚 Register Double Buffers](#HGEMM-bench)
61
+ - [📚 Collective Store(Shfl)](#HGEMM-bench)
62
+ - [📚 Layout NN/TN](#HGEMM-bench)
65
63
- [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench)
66
64
- [📚 Split KV (Basic, FA-1)](#mma-split-kv)
67
65
- [📚 Split Q (Faster, FA-2)](#mma-split-q)
68
66
- [📚 Split Q + Shared KV](#mma-share-kv)
69
67
- [📚 Split Q + Shared QKV](#mma-share-qkv)
70
68
- [📚 Split Q + QK Tiling](#mma-tiling-qk)
71
69
- [📚 Split Q + QKV Tiling](#mma-tiling-qkv)
70
+ - [📖 How to Contribute? 👀👇](#contribute)
71
+ - [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench)
72
+ - [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench)
72
73
- [📖 200+ CUDA Kernels 🔥🔥](#cuda-kernel)
73
74
- [📚 Easy ⭐️](#cuda-kernel-easy-medium)
74
75
- [📚 Medium ⭐️⭐️](#cuda-kernel-easy-medium)
86
87
- [📚 CuTe系列详解与实践](#other-blogs)
87
88
- [📚 GPU指令集架构精解](#other-blogs)
88
89
- [📚 GPU通信架构精解](#other-blogs)
90
+ -->
91
+
92
+ - [ 📖 HGEMM-MMA 🎉🎉] ( #HGEMM-bench )
93
+ - [ 📖 FlashAttention-MMA 🎉🎉] ( #fa-mma-bench )
94
+ - [ 📖 200+ CUDA Kernels 🔥🔥] ( #cuda-kernel )
95
+ - [ 📖 100+ 高性能计算文章 💡💡] ( #my-blogs-part-1 )
96
+ - [ 📖 How to Contribute 👀👇] ( #contribute )
89
97
90
98
## 📖 HGEMM Benchmark 🎉🎉
91
99
92
- <div id =" hgemm-tensorcores-mma -bench" ></div >
100
+ <div id =" HGEMM -bench" ></div >
93
101
94
- Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores algorithm, the ` HGEMM (WMMA/MMA/CuTe) ` in this repo (` blue ` 🔵) can achieve ` 98%~100% ` of its (` orange ` 🟠) performance. Please check [ toy-hgemm library⚡️⚡️] ( ./kernels/hgemm ) or [ hgemm-tensorcores-mma ⚡️⚡️] ( https://github.com/xlite-dev/hgemm-tensorcores-mma ) repo for more details.
102
+ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores algorithm, the ` HGEMM (WMMA/MMA/CuTe) ` in this repo (` blue ` 🔵) can achieve ` 98%~100% ` of its (` orange ` 🟠) performance. Please check [ toy-hgemm library⚡️⚡️] ( ./kernels/hgemm ) or [ HGEMM ⚡️⚡️] ( https://github.com/xlite-dev/HGEMM ) repo for more details.
95
103
96
104
![ toy-hgemm-library] ( https://github.com/user-attachments/assets/962bda14-b494-4423-b8eb-775da9f5503d )
97
105
@@ -118,7 +126,7 @@ I have also implemented **FlashAttention-2** using pure MMA PTX instructions, wh
118
126
| ✔️Tile Warps| ✔️Multi Stages(1/2)| ✔️Collective Store(Shfl)| ✔️** Split KV/Q** |
119
127
| ✔️** Shared QKV** SMEM| ✔️** Prefetch Q** s2r| ✔️** Prefetch KV** g2s| ✔️** QKV Fine-grained Tiling** |
120
128
121
- Currently, for small-scale attention ` (B<=4, H <=48, SeqLen <= 8192, D <= 64) ` it can run faster than FA2/SDPA on some Devices. For example, on NVIDIA RTX 3080 Laptop, [ 📚 Split Q + Fully Shared QKV SMEM] ( #mma-share-qkv ) method can achieve ** 55 TFLOPS (D=64)** that almost ** ~ 1.5x** 🎉 faster than FA2. On NVIDIA L20, 🤖[ ffpa-attn-mma ] ( https://github.com/xlite-dev/ffpa-attn-mma ) method can achieve ** 104 TFLOPS (D=512)** that almost ** ~ 1.8x** 🎉 faster than SDPA (EFFICIENT ATTENTION). However, for large-scale attention, there remains a performance gap. Stay tuned for updates ~ (MMA Acc F16/F32, softmax Acc F32 vs FA2 MMA/softmax Acc F32, 👇Benchmark)
129
+ Currently, for small-scale attention ` (B<=4, H <=48, SeqLen <= 8192, D <= 64) ` it can run faster than FA2/SDPA on some Devices. For example, on NVIDIA RTX 3080 Laptop, [ 📚 Split Q + Fully Shared QKV SMEM] ( #mma-share-qkv ) method can achieve ** 55 TFLOPS (D=64)** that almost ** ~ 1.5x** 🎉 faster than FA2. On NVIDIA L20, 🤖[ ffpa-attn] ( https://github.com/xlite-dev/ffpa-attn ) method can achieve ** 104 TFLOPS (D=512)** that almost ** ~ 1.8x** 🎉 faster than SDPA (EFFICIENT ATTENTION). However, for large-scale attention, there remains a performance gap. Stay tuned for updates ~ (MMA Acc F16/F32, softmax Acc F32 vs FA2 MMA/softmax Acc F32, 👇Benchmark)
122
130
123
131
| Algorithm| (B,H,N,D) | RTX 3080 Laptop | L20 | RTX 4090 |
124
132
| :---:| :---:| :---:| :---:| :---:|
@@ -127,7 +135,7 @@ Currently, for small-scale attention `(B<=4, H <=48, SeqLen <= 8192, D <= 64)` i
127
135
| FlashAttention-2| (1,48,8192,64)| 37 TFLOPS| 109 TFLOPS| 163 TFLOPS|
128
136
| share-qkv+stage2| (1,48,8192,64)| ** 48 TFLOPS** | 107 TFLOPS| ** 224 TFLOPS** |
129
137
| SDPA(EFFICIENT ATTENTION)| (1,48,8192,512)| 16 TFLOPS| 58 TFLOPS| 85 TFLOPS|
130
- | 🤖[ ffpa-attn-mma ] ( https://github.com/xlite-dev/ffpa-attn-mma ) | (1,48,8192,512)| ** 39 TFLOPS** | ** 104 TFLOPS** | ** 200 TFLOPS** |
138
+ | 🤖[ ffpa-attn] ( https://github.com/xlite-dev/ffpa-attn ) | (1,48,8192,512)| ** 39 TFLOPS** | ** 104 TFLOPS** | ** 200 TFLOPS** |
131
139
| Precision Errors vs FA2/SDPA| / | max: < ~ 1e-3 | min: ~ 0.0 | mean: < ~ 1e-5 |
132
140
133
141
The ` Split KV ` and ` Split Q ` implementations have been carried out in [ flash-attention-mma⚡️⚡️] ( ./kernels/flash-attn ) for performance comparison. The ` Split KV ` method, which involves splitting all QKV across MMA (Warps), is slower than ` Split Q ` method, which splitting Q across MMA(Warps) and keep access KV for all MMA(Warps).
@@ -208,7 +216,7 @@ __global__ void // Q, K, V, O -> [B, H, N, D]
208
216
flash_attn_mma_stages_split_q_tiling_qkv_kernel(half* Q, half* K, half* V, half* O, ...);
209
217
```
210
218
211
- 💡NOTE: [ 📚Split Q + Fully QKV Fine-grained Tiling] ( #mma-tiling-qkv ) has been refactored into 🤖[ ffpa-attn-mma ] ( https://github.com/xlite-dev/ffpa-attn-mma ) .
219
+ 💡NOTE: [ 📚Split Q + Fully QKV Fine-grained Tiling] ( #mma-tiling-qkv ) has been refactored into 🤖[ ffpa-attn] ( https://github.com/xlite-dev/ffpa-attn ) .
212
220
213
221
## ©️Citations🎉🎉
214
222
@@ -453,17 +461,17 @@ The kernels listed here will guide you through a step-by-step progression, rangi
453
461
454
462
| 📖 CUDA Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level |
455
463
| :---| :---| :---| :---| :---|
456
- | ✔️ [ ffpa_mma_stages_split_q_L1_F16F16F16] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L1.cu ) | f16| f16| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
457
- | ✔️ [ ffpa_mma_stages_split_q_L1_F16F16F32] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu ) | f16| f32| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
458
- | ✔️ [ ffpa_mma_stages_split_q_L1_mixed_acc] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu ) | f16| QK f32, PV f16| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
459
- | ⚠️ [ ffpa_mma_stages_split_q_L2_F16F16F16] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L2.cu ) | f16| f16| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
460
- | ⚠️ [ ffpa_mma_stages_split_q_L2_F16F16F32] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu ) | f16| f32| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
461
- | ⚠️ [ ffpa_mma_stages_split_q_L2_mixed_acc] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu ) | f16| QK f32, PV f16| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
462
- | ⚠️ [ ffpa_mma_stages_split_q_L3_F16F16F16] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L3.cu ) | f16| f16| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
463
- | ⚠️ [ ffpa_mma_stages_split_q_L3_F16F16F32] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu ) | f16| f32| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
464
- | ⚠️ [ ffpa_mma_stages_split_q_L3_mixed_acc] ( https://github.com/xlite-dev/ffpa-attn-mma /blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu ) | f16| QK f32, PV f16| [ link] ( https://github.com/xlite-dev/ffpa-attn-mma ) | ⭐️⭐️⭐️⭐️|
465
-
466
- 💡NOTE: 🤖[ ffpa-attn-mma ] ( https://github.com/xlite-dev/ffpa-attn-mma ) : 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, ** 1.8x~ 3x** 🎉faster than SDPA EA: [ 📈L20 ~ 1.9x↑🎉] ( https://github.com/xlite-dev/ffpa-attn-mma ?tab=readme-ov-file#L1-bench-l20 ) , [ 📈 A30 ~ 1.8x↑🎉] ( https://github.com/xlite-dev/ffpa-attn-mma ?tab=readme-ov-file#L1-bench-a30 ) , [ 📈3080 ~ 2.9x↑🎉] ( https://github.com/xlite-dev/ffpa-attn-mma ?tab=readme-ov-file#L1-bench-3080 ) , [ 📈4090 ~ 2.1x↑🎉] ( https://github.com/xlite-dev/ffpa-attn-mma ?tab=readme-ov-file#L1-bench-4090 ) .
464
+ | ✔️ [ ffpa_mma_stages_split_q_L1_F16F16F16] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L1.cu ) | f16| f16| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
465
+ | ✔️ [ ffpa_mma_stages_split_q_L1_F16F16F32] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu ) | f16| f32| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
466
+ | ✔️ [ ffpa_mma_stages_split_q_L1_mixed_acc] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L1.cu ) | f16| QK f32, PV f16| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
467
+ | ⚠️ [ ffpa_mma_stages_split_q_L2_F16F16F16] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L2.cu ) | f16| f16| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
468
+ | ⚠️ [ ffpa_mma_stages_split_q_L2_F16F16F32] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu ) | f16| f32| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
469
+ | ⚠️ [ ffpa_mma_stages_split_q_L2_mixed_acc] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L2.cu ) | f16| QK f32, PV f16| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
470
+ | ⚠️ [ ffpa_mma_stages_split_q_L3_F16F16F16] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F16_L3.cu ) | f16| f16| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
471
+ | ⚠️ [ ffpa_mma_stages_split_q_L3_F16F16F32] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu ) | f16| f32| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
472
+ | ⚠️ [ ffpa_mma_stages_split_q_L3_mixed_acc] ( https://github.com/xlite-dev/ffpa-attn/blob/main/csrc/cuffpa/ffpa_attn_F16F16F32_L3.cu ) | f16| QK f32, PV f16| [ link] ( https://github.com/xlite-dev/ffpa-attn ) | ⭐️⭐️⭐️⭐️|
473
+
474
+ 💡NOTE: 🤖[ ffpa-attn] ( https://github.com/xlite-dev/ffpa-attn ) : 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, ** 1.8x~ 3x** 🎉faster than SDPA EA: [ 📈L20 ~ 1.9x↑🎉] ( https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-l20 ) , [ 📈 A30 ~ 1.8x↑🎉] ( https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-a30 ) , [ 📈3080 ~ 2.9x↑🎉] ( https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-3080 ) , [ 📈4090 ~ 2.1x↑🎉] ( https://github.com/xlite-dev/ffpa-attn?tab=readme-ov-file#L1-bench-4090 ) .
467
475
468
476
### 📚 Triton Kernel (OpenAI Triton) ([ ©️back👆🏻] ( #cuda-kernel ) )
469
477
0 commit comments