misc: manually update submodules (#318)

DefTruth · web-flow · commit 951c861f1ba1 · 2025-05-10T15:47:38.000+08:00
* misc: manually update submodules

* misc: manually update submodules

* misc: manually update submodules
diff --git a/.dev/update_submodules.sh b/.dev/update_submodules.sh
@@ -2,7 +2,6 @@
 set -x
 git submodule init
 git submodule update --remote # update all submodule
-# git submodule update --remote ffpa-attn-mma # only update ffpa-attn-mma
 git add .
-git commit -m "misc: Automated submodule update"
+git commit -m "misc: automated submodule update"
 set +x
diff --git a/.gitmodules b/.gitmodules
@@ -1,9 +1,9 @@
 [submodule "third-party/cutlass"]
 	path = third-party/cutlass
 	url = https://github.com/NVIDIA/cutlass.git
-[submodule "ffpa-attn-mma"]
-	path = ffpa-attn-mma
-	url = https://github.com/xlite-dev/ffpa-attn-mma.git
-[submodule "hgemm-tensorcores-mma"]
-	path = hgemm-tensorcores-mma
-	url = https://github.com/xlite-dev/hgemm-tensorcores-mma
+[submodule "ffpa-attn"]
+	path = ffpa-attn
+	url = https://github.com/xlite-dev/ffpa-attn.git
+[submodule "HGEMM"]
+	path = HGEMM
+	url = https://github.com/xlite-dev/HGEMM.git
diff --git a/HGEMM b/HGEMM
@@ -0,0 +1 @@
+Subproject commit eee72be829545bd6bd115a4b252b5068c9f61597
diff --git a/README.md b/README.md
@@ -21,7 +21,8 @@
  </div>
 </div>
 
-📚 **LeetCUDA**: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖200+ CUDA Kernels🔥](#cuda-kernel) with PyTorch, [📖100+ LLM/CUDA🔥](#my-blogs-part-1) blogs, [📖HGEMM⚡️](./kernels/hgemm) which can achieve `98%~100%` TFLOPS of **cuBLAS**, and [📖flash-attn-mma⚡️](./kernels/flash-attn) using Tensor Cores with pure MMA PTX.
+📚 **LeetCUDA**: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖200+ CUDA Kernels🔥](#cuda-kernel) with PyTorch, [📖100+ LLM/CUDA🔥](#my-blogs-part-1) blogs, [📖HGEMM⚡️](./kernels/hgemm) which can achieve `98%~100%` TFLOPS of **cuBLAS**, and [📖flash-attn-mma⚡️](./kernels/flash-attn) using Tensor Cores with pure MMA PTX. ♥️ Please consider to leave a ⭐️ Star to support me, my bro ~ ♥️
+
 <div align="center">
   <p align="center">
     <a href="#contribute">🔥🔥 PR Welcome: Add Your Kernel to LeetCUDA! Let's make it Awesome together! 🎉🎉</a>
@@ -50,6 +51,14 @@
 ## 📖 Contents
 <div id="contents"></div>
 <!---
+- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench)
+- [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench)
+- [📖 200+ CUDA Kernels 🔥🔥](#cuda-kernel)
+- [📖 100+ 高性能计算文章 💡💡](#my-blogs-part-1)
+- [📖 How to Contribute 👀👇](#contribute)
+--->
+
+- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench)
   - [📚 CUDA/Tensor Cores](#HGEMM-bench)
   - [📚 Tile Block(Br, Bc)](#HGEMM-bench)
   - [📚 Tile MMAs/Warps](#HGEMM-bench)
@@ -67,9 +76,6 @@
   - [📚 Split Q + Shared QKV](#mma-share-qkv)
   - [📚 Split Q + QK Tiling](#mma-tiling-qk)
   - [📚 Split Q + QKV Tiling](#mma-tiling-qkv)
-- [📖 How to Contribute? 👀👇](#contribute)
-- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench)
-- [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench)
 - [📖 200+ CUDA Kernels 🔥🔥](#cuda-kernel)
   - [📚 Easy ⭐️](#cuda-kernel-easy-medium)
   - [📚 Medium ⭐️⭐️](#cuda-kernel-easy-medium)
@@ -87,14 +93,9 @@
   - [📚 CuTe系列详解与实践](#other-blogs)
   - [📚 GPU指令集架构精解](#other-blogs)
   - [📚 GPU通信架构精解](#other-blogs)
--->  
-
-- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench)
-- [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench)
-- [📖 200+ CUDA Kernels 🔥🔥](#cuda-kernel)
-- [📖 100+ 高性能计算文章 💡💡](#my-blogs-part-1)
 - [📖 How to Contribute 👀👇](#contribute)
 
+
 ## 📖 HGEMM Benchmark 🎉🎉
 
 <div id="HGEMM-bench"></div>
@@ -490,6 +491,7 @@ The kernels listed here will guide you through a step-by-step progression, rangi
 |📖 CUTLASS/CuTe Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level |
 |:---|:---|:---|:---|:---|
 | ✔️ [mat_transpose_cute](./kernels/mat-transpose/mat_transpose_cute.cu)|f32|/|[link](./kernels/mat-transpose/)|⭐️⭐️|
+| ✔️ [flash_attn_cute(naive)](./kernels/flash-attn/flash_attn_cute.cu)|f16|f32|[link](./kernels/flash-attn/)|⭐️⭐️⭐️|
 | ✔️ [hgemm_mma_stages_swizzle{smem}...cute*](./kernels/hgemm/cutlass/hgemm_mma_stage_tn_cute.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️|
 
 ## 📖 100+ 高性能计算与分布式-技术博客
diff --git a/ffpa-attn b/ffpa-attn
@@ -0,0 +1 @@
+Subproject commit 3b90cc1d47eb921503dd57848e13a52f2a8cf2be
diff --git a/ffpa-attn-mma b/ffpa-attn-mma
diff --git a/hgemm-tensorcores-mma b/hgemm-tensorcores-mma
diff --git a/kernels/hgemm/README.md b/kernels/hgemm/README.md
@@ -3,7 +3,7 @@
 
 ![toy-hgemm-library](https://github.com/user-attachments/assets/962bda14-b494-4423-b8eb-775da9f5503d)
 
-[📖Toy-HGEMM Library⚡️⚡️](./kernels/hgemm) is a library that write many HGEMM kernels from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API, thus, can achieve `98%~100%` performance of **cuBLAS**. The codes here are source from 📖[CUDA-Learn-Notes](https://github.com/xlite-dev/CUDA-Learn-Notes)  ![](https://img.shields.io/github/stars/xlite-dev/CUDA-Learn-Notes.svg?style=social) and exported as a standalone library, please checkout [CUDA-Learn-Notes](https://github.com/xlite-dev/CUDA-Learn-Notes) for latest updates. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉
+[📖Toy-HGEMM Library⚡️⚡️](./kernels/hgemm) is a library that write many HGEMM kernels from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API, thus, can achieve `98%~100%` performance of **cuBLAS**. The codes here are source from 📖[LeetCUDA](https://github.com/xlite-dev/LeetCUDA)  ![](https://img.shields.io/github/stars/xlite-dev/LeetCUDA.svg?style=social) and exported as a standalone library, please checkout [LeetCUDA](https://github.com/xlite-dev/LeetCUDA) for latest updates. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉
 
 <div id="hgemm-sgemm"></div>
 
@@ -27,10 +27,10 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's d
 ## ©️Citations🎉🎉
 
 ```BibTeX
-@misc{hgemm-tensorcores-mma@2024,
-  title={hgemm-tensorcores-mma: Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API.},
-  url={https://github.com/xlite-dev/hgemm-tensorcores-mma},
-  note={Open-source software available at https://github.com/xlite-dev/hgemm-tensorcores-mma},
+@misc{HGEMM@2024,
+  title={HGEMM: Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API.},
+  url={https://github.com/xlite-dev/HGEMM},
+  note={Open-source software available at https://github.com/xlite-dev/HGEMM},
   author={xlite-dev etc},
   year={2024}
 }