|
21 | 21 | </div>
|
22 | 22 | </div>
|
23 | 23 |
|
24 |
| -📚 **LeetCUDA**: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖200+ CUDA Kernels🔥](#cuda-kernel) with PyTorch, [📖100+ LLM/CUDA🔥](#my-blogs-part-1) blogs, [📖HGEMM⚡️](./kernels/hgemm) which can achieve `98%~100%` TFLOPS of **cuBLAS**, and [📖flash-attn-mma⚡️](./kernels/flash-attn) using Tensor Cores with pure MMA PTX. |
| 24 | +📚 **LeetCUDA**: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖200+ CUDA Kernels🔥](#cuda-kernel) with PyTorch, [📖100+ LLM/CUDA🔥](#my-blogs-part-1) blogs, [📖HGEMM⚡️](./kernels/hgemm) which can achieve `98%~100%` TFLOPS of **cuBLAS**, and [📖flash-attn-mma⚡️](./kernels/flash-attn) using Tensor Cores with pure MMA PTX. ♥️ Please consider to leave a ⭐️ Star to support me, my bro ~ ♥️ |
| 25 | + |
25 | 26 | <div align="center">
|
26 | 27 | <p align="center">
|
27 | 28 | <a href="#contribute">🔥🔥 PR Welcome: Add Your Kernel to LeetCUDA! Let's make it Awesome together! 🎉🎉</a>
|
|
50 | 51 | ## 📖 Contents
|
51 | 52 | <div id="contents"></div>
|
52 | 53 | <!---
|
| 54 | +- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench) |
| 55 | +- [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench) |
| 56 | +- [📖 200+ CUDA Kernels 🔥🔥](#cuda-kernel) |
| 57 | +- [📖 100+ 高性能计算文章 💡💡](#my-blogs-part-1) |
| 58 | +- [📖 How to Contribute 👀👇](#contribute) |
| 59 | +---> |
| 60 | + |
| 61 | +- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench) |
53 | 62 | - [📚 CUDA/Tensor Cores](#HGEMM-bench)
|
54 | 63 | - [📚 Tile Block(Br, Bc)](#HGEMM-bench)
|
55 | 64 | - [📚 Tile MMAs/Warps](#HGEMM-bench)
|
|
67 | 76 | - [📚 Split Q + Shared QKV](#mma-share-qkv)
|
68 | 77 | - [📚 Split Q + QK Tiling](#mma-tiling-qk)
|
69 | 78 | - [📚 Split Q + QKV Tiling](#mma-tiling-qkv)
|
70 |
| -- [📖 How to Contribute? 👀👇](#contribute) |
71 |
| -- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench) |
72 |
| -- [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench) |
73 | 79 | - [📖 200+ CUDA Kernels 🔥🔥](#cuda-kernel)
|
74 | 80 | - [📚 Easy ⭐️](#cuda-kernel-easy-medium)
|
75 | 81 | - [📚 Medium ⭐️⭐️](#cuda-kernel-easy-medium)
|
|
87 | 93 | - [📚 CuTe系列详解与实践](#other-blogs)
|
88 | 94 | - [📚 GPU指令集架构精解](#other-blogs)
|
89 | 95 | - [📚 GPU通信架构精解](#other-blogs)
|
90 |
| ---> |
91 |
| - |
92 |
| -- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench) |
93 |
| -- [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench) |
94 |
| -- [📖 200+ CUDA Kernels 🔥🔥](#cuda-kernel) |
95 |
| -- [📖 100+ 高性能计算文章 💡💡](#my-blogs-part-1) |
96 | 96 | - [📖 How to Contribute 👀👇](#contribute)
|
97 | 97 |
|
| 98 | + |
98 | 99 | ## 📖 HGEMM Benchmark 🎉🎉
|
99 | 100 |
|
100 | 101 | <div id="HGEMM-bench"></div>
|
@@ -490,6 +491,7 @@ The kernels listed here will guide you through a step-by-step progression, rangi
|
490 | 491 | |📖 CUTLASS/CuTe Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level |
|
491 | 492 | |:---|:---|:---|:---|:---|
|
492 | 493 | | ✔️ [mat_transpose_cute](./kernels/mat-transpose/mat_transpose_cute.cu)|f32|/|[link](./kernels/mat-transpose/)|⭐️⭐️|
|
| 494 | +| ✔️ [flash_attn_cute(naive)](./kernels/flash-attn/flash_attn_cute.cu)|f16|f32|[link](./kernels/flash-attn/)|⭐️⭐️⭐️| |
493 | 495 | | ✔️ [hgemm_mma_stages_swizzle{smem}...cute*](./kernels/hgemm/cutlass/hgemm_mma_stage_tn_cute.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️|
|
494 | 496 |
|
495 | 497 | ## 📖 100+ 高性能计算与分布式-技术博客
|
|
0 commit comments