|
21 | 21 | </div> |
22 | 22 | </div> |
23 | 23 |
|
24 | | -📚 **LeetCUDA**: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖200+ CUDA Kernels🔥](#cuda-kernel) with PyTorch, [📖100+ LLM/CUDA🔥](#my-blogs-part-1) blogs, [📖HGEMM⚡️](./kernels/hgemm) which can achieve `98%~100%` TFLOPS of **cuBLAS**, and [📖flash-attn-mma⚡️](./kernels/flash-attn) using Tensor Cores with pure MMA PTX. |
| 24 | +📚 **LeetCUDA**: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖200+ CUDA Kernels🔥](#cuda-kernel) with PyTorch, [📖100+ LLM/CUDA🔥](#my-blogs-part-1) blogs, [📖HGEMM⚡️](./kernels/hgemm) which can achieve `98%~100%` TFLOPS of **cuBLAS**, and [📖flash-attn-mma⚡️](./kernels/flash-attn) using Tensor Cores with pure MMA PTX. ♥️ Please consider to leave a ⭐️ Star to support me, my bro ~ ♥️ |
| 25 | + |
25 | 26 | <div align="center"> |
26 | 27 | <p align="center"> |
27 | 28 | <a href="#contribute">🔥🔥 PR Welcome: Add Your Kernel to LeetCUDA! Let's make it Awesome together! 🎉🎉</a> |
|
50 | 51 | ## 📖 Contents |
51 | 52 | <div id="contents"></div> |
52 | 53 | <!--- |
| 54 | +- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench) |
| 55 | +- [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench) |
| 56 | +- [📖 200+ CUDA Kernels 🔥🔥](#cuda-kernel) |
| 57 | +- [📖 100+ 高性能计算文章 💡💡](#my-blogs-part-1) |
| 58 | +- [📖 How to Contribute 👀👇](#contribute) |
| 59 | +---> |
| 60 | + |
| 61 | +- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench) |
53 | 62 | - [📚 CUDA/Tensor Cores](#HGEMM-bench) |
54 | 63 | - [📚 Tile Block(Br, Bc)](#HGEMM-bench) |
55 | 64 | - [📚 Tile MMAs/Warps](#HGEMM-bench) |
|
67 | 76 | - [📚 Split Q + Shared QKV](#mma-share-qkv) |
68 | 77 | - [📚 Split Q + QK Tiling](#mma-tiling-qk) |
69 | 78 | - [📚 Split Q + QKV Tiling](#mma-tiling-qkv) |
70 | | -- [📖 How to Contribute? 👀👇](#contribute) |
71 | | -- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench) |
72 | | -- [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench) |
73 | 79 | - [📖 200+ CUDA Kernels 🔥🔥](#cuda-kernel) |
74 | 80 | - [📚 Easy ⭐️](#cuda-kernel-easy-medium) |
75 | 81 | - [📚 Medium ⭐️⭐️](#cuda-kernel-easy-medium) |
|
87 | 93 | - [📚 CuTe系列详解与实践](#other-blogs) |
88 | 94 | - [📚 GPU指令集架构精解](#other-blogs) |
89 | 95 | - [📚 GPU通信架构精解](#other-blogs) |
90 | | ---> |
91 | | - |
92 | | -- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench) |
93 | | -- [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench) |
94 | | -- [📖 200+ CUDA Kernels 🔥🔥](#cuda-kernel) |
95 | | -- [📖 100+ 高性能计算文章 💡💡](#my-blogs-part-1) |
96 | 96 | - [📖 How to Contribute 👀👇](#contribute) |
97 | 97 |
|
| 98 | + |
98 | 99 | ## 📖 HGEMM Benchmark 🎉🎉 |
99 | 100 |
|
100 | 101 | <div id="HGEMM-bench"></div> |
@@ -490,6 +491,7 @@ The kernels listed here will guide you through a step-by-step progression, rangi |
490 | 491 | |📖 CUTLASS/CuTe Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level | |
491 | 492 | |:---|:---|:---|:---|:---| |
492 | 493 | | ✔️ [mat_transpose_cute](./kernels/mat-transpose/mat_transpose_cute.cu)|f32|/|[link](./kernels/mat-transpose/)|⭐️⭐️| |
| 494 | +| ✔️ [flash_attn_cute(naive)](./kernels/flash-attn/flash_attn_cute.cu)|f16|f32|[link](./kernels/flash-attn/)|⭐️⭐️⭐️| |
493 | 495 | | ✔️ [hgemm_mma_stages_swizzle{smem}...cute*](./kernels/hgemm/cutlass/hgemm_mma_stage_tn_cute.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️| |
494 | 496 |
|
495 | 497 | ## 📖 100+ 高性能计算与分布式-技术博客 |
|
0 commit comments