Skip to content

Commit 951c861

Browse files
authored
misc: manually update submodules (#318)
* misc: manually update submodules * misc: manually update submodules * misc: manually update submodules
1 parent 03adf53 commit 951c861

File tree

8 files changed

+26
-25
lines changed

8 files changed

+26
-25
lines changed

.dev/update_submodules.sh

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
set -x
33
git submodule init
44
git submodule update --remote # update all submodule
5-
# git submodule update --remote ffpa-attn-mma # only update ffpa-attn-mma
65
git add .
7-
git commit -m "misc: Automated submodule update"
6+
git commit -m "misc: automated submodule update"
87
set +x

.gitmodules

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
[submodule "third-party/cutlass"]
22
path = third-party/cutlass
33
url = https://github.com/NVIDIA/cutlass.git
4-
[submodule "ffpa-attn-mma"]
5-
path = ffpa-attn-mma
6-
url = https://github.com/xlite-dev/ffpa-attn-mma.git
7-
[submodule "hgemm-tensorcores-mma"]
8-
path = hgemm-tensorcores-mma
9-
url = https://github.com/xlite-dev/hgemm-tensorcores-mma
4+
[submodule "ffpa-attn"]
5+
path = ffpa-attn
6+
url = https://github.com/xlite-dev/ffpa-attn.git
7+
[submodule "HGEMM"]
8+
path = HGEMM
9+
url = https://github.com/xlite-dev/HGEMM.git

HGEMM

Submodule HGEMM added at eee72be

README.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,8 @@
2121
</div>
2222
</div>
2323

24-
📚 **LeetCUDA**: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖200+ CUDA Kernels🔥](#cuda-kernel) with PyTorch, [📖100+ LLM/CUDA🔥](#my-blogs-part-1) blogs, [📖HGEMM⚡️](./kernels/hgemm) which can achieve `98%~100%` TFLOPS of **cuBLAS**, and [📖flash-attn-mma⚡️](./kernels/flash-attn) using Tensor Cores with pure MMA PTX.
24+
📚 **LeetCUDA**: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖200+ CUDA Kernels🔥](#cuda-kernel) with PyTorch, [📖100+ LLM/CUDA🔥](#my-blogs-part-1) blogs, [📖HGEMM⚡️](./kernels/hgemm) which can achieve `98%~100%` TFLOPS of **cuBLAS**, and [📖flash-attn-mma⚡️](./kernels/flash-attn) using Tensor Cores with pure MMA PTX. ♥️ Please consider to leave a ⭐️ Star to support me, my bro ~ ♥️
25+
2526
<div align="center">
2627
<p align="center">
2728
<a href="#contribute">🔥🔥 PR Welcome: Add Your Kernel to LeetCUDA! Let's make it Awesome together! 🎉🎉</a>
@@ -50,6 +51,14 @@
5051
## 📖 Contents
5152
<div id="contents"></div>
5253
<!---
54+
- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench)
55+
- [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench)
56+
- [📖 200+ CUDA Kernels 🔥🔥](#cuda-kernel)
57+
- [📖 100+ 高性能计算文章 💡💡](#my-blogs-part-1)
58+
- [📖 How to Contribute 👀👇](#contribute)
59+
--->
60+
61+
- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench)
5362
- [📚 CUDA/Tensor Cores](#HGEMM-bench)
5463
- [📚 Tile Block(Br, Bc)](#HGEMM-bench)
5564
- [📚 Tile MMAs/Warps](#HGEMM-bench)
@@ -67,9 +76,6 @@
6776
- [📚 Split Q + Shared QKV](#mma-share-qkv)
6877
- [📚 Split Q + QK Tiling](#mma-tiling-qk)
6978
- [📚 Split Q + QKV Tiling](#mma-tiling-qkv)
70-
- [📖 How to Contribute? 👀👇](#contribute)
71-
- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench)
72-
- [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench)
7379
- [📖 200+ CUDA Kernels 🔥🔥](#cuda-kernel)
7480
- [📚 Easy ⭐️](#cuda-kernel-easy-medium)
7581
- [📚 Medium ⭐️⭐️](#cuda-kernel-easy-medium)
@@ -87,14 +93,9 @@
8793
- [📚 CuTe系列详解与实践](#other-blogs)
8894
- [📚 GPU指令集架构精解](#other-blogs)
8995
- [📚 GPU通信架构精解](#other-blogs)
90-
-->
91-
92-
- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench)
93-
- [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench)
94-
- [📖 200+ CUDA Kernels 🔥🔥](#cuda-kernel)
95-
- [📖 100+ 高性能计算文章 💡💡](#my-blogs-part-1)
9696
- [📖 How to Contribute 👀👇](#contribute)
9797

98+
9899
## 📖 HGEMM Benchmark 🎉🎉
99100

100101
<div id="HGEMM-bench"></div>
@@ -490,6 +491,7 @@ The kernels listed here will guide you through a step-by-step progression, rangi
490491
|📖 CUTLASS/CuTe Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level |
491492
|:---|:---|:---|:---|:---|
492493
| ✔️ [mat_transpose_cute](./kernels/mat-transpose/mat_transpose_cute.cu)|f32|/|[link](./kernels/mat-transpose/)|⭐️⭐️|
494+
| ✔️ [flash_attn_cute(naive)](./kernels/flash-attn/flash_attn_cute.cu)|f16|f32|[link](./kernels/flash-attn/)|⭐️⭐️⭐️|
493495
| ✔️ [hgemm_mma_stages_swizzle{smem}...cute*](./kernels/hgemm/cutlass/hgemm_mma_stage_tn_cute.cu)|f16|f16|[link](./kernels/hgemm/)|⭐️⭐️⭐️|
494496

495497
## 📖 100+ 高性能计算与分布式-技术博客

ffpa-attn

Submodule ffpa-attn added at 3b90cc1

ffpa-attn-mma

Lines changed: 0 additions & 1 deletion
This file was deleted.

hgemm-tensorcores-mma

Lines changed: 0 additions & 1 deletion
This file was deleted.

kernels/hgemm/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
![toy-hgemm-library](https://github.com/user-attachments/assets/962bda14-b494-4423-b8eb-775da9f5503d)
55

6-
[📖Toy-HGEMM Library⚡️⚡️](./kernels/hgemm) is a library that write many HGEMM kernels from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API, thus, can achieve `98%~100%` performance of **cuBLAS**. The codes here are source from 📖[CUDA-Learn-Notes](https://github.com/xlite-dev/CUDA-Learn-Notes) ![](https://img.shields.io/github/stars/xlite-dev/CUDA-Learn-Notes.svg?style=social) and exported as a standalone library, please checkout [CUDA-Learn-Notes](https://github.com/xlite-dev/CUDA-Learn-Notes) for latest updates. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉
6+
[📖Toy-HGEMM Library⚡️⚡️](./kernels/hgemm) is a library that write many HGEMM kernels from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API, thus, can achieve `98%~100%` performance of **cuBLAS**. The codes here are source from 📖[LeetCUDA](https://github.com/xlite-dev/LeetCUDA) ![](https://img.shields.io/github/stars/xlite-dev/LeetCUDA.svg?style=social) and exported as a standalone library, please checkout [LeetCUDA](https://github.com/xlite-dev/LeetCUDA) for latest updates. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉
77

88
<div id="hgemm-sgemm"></div>
99

@@ -27,10 +27,10 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's d
2727
## ©️Citations🎉🎉
2828

2929
```BibTeX
30-
@misc{hgemm-tensorcores-mma@2024,
31-
title={hgemm-tensorcores-mma: Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API.},
32-
url={https://github.com/xlite-dev/hgemm-tensorcores-mma},
33-
note={Open-source software available at https://github.com/xlite-dev/hgemm-tensorcores-mma},
30+
@misc{HGEMM@2024,
31+
title={HGEMM: Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API.},
32+
url={https://github.com/xlite-dev/HGEMM},
33+
note={Open-source software available at https://github.com/xlite-dev/HGEMM},
3434
author={xlite-dev etc},
3535
year={2024}
3636
}

0 commit comments

Comments
 (0)