Skip to content

Commit 6ea2eb9

Browse files
authored
[HGEMM] Release toy-hgemm library 0.1.0 (#145)
* Refactor HGEMM codes * Refactor HGEMM codes * Refactor HGEMM codes * Create utils.py * Update hgemm.py * Update setup.py * Update hgemm.cc * Update utils.py * Update setup.py * Create clear.sh * Update setup.py * Update utils.py * Update hgemm.py * Update utils.py * Delete hgemm/utils.py * Create utils.py * Update utils.py * Create clear.sh * Create install.sh * Delete hgemm/clear.sh * Update hgemm.py * Update utils.py * Update setup.py * Update README.md * Update utils.py * Update setup.py * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md
1 parent 60d4ad2 commit 6ea2eb9

24 files changed

+459
-287
lines changed

README.md

Lines changed: 22 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616

1717
<div id="contents"></div>
1818

19-
📚 **Modern CUDA Learn Notes with PyTorch** for Beginners: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖150+ CUDA Kernels🔥🔥](#cuda-kernel) with PyTorch bindings, [📖30+ LLM/VLM🔥](#my-blogs-part-1), [📖40+ CV/C++...🔥](#my-blogs-part-2), [📖50+ CUDA/CuTe...🔥](#other-blogs) Blogs and [📖HGEMM/SGEMM🔥🔥](#hgemm-sgemm) which has been fully optimized, check [📖HGEMM/SGEMM Supported Matrix👇](#hgemm-sgemm) for more details. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉
19+
📚 **Modern CUDA Learn Notes with PyTorch** for Beginners: It includes **Tensor/CUDA Cores, TF32/F16/BF16/F8**, [📖150+ CUDA Kernels🔥🔥](#cuda-kernel) with PyTorch bindings, [📖30+ LLM/VLM🔥](#my-blogs-part-1), [📖40+ CV/C++...🔥](#my-blogs-part-2), [📖50+ CUDA/CuTe...🔥](#other-blogs) Blogs and [📖toy-hgemm library🔥🔥](./hgemm) which can achieve the performance of **cuBLAS**, check [📖HGEMM Supported Matrix👇](#hgemm-sgemm) for more details. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉
2020

2121
<div id="hgemm-sgemm"></div>
2222

@@ -25,7 +25,7 @@
2525
<img src='https://github.com/user-attachments/assets/05ef4f5e-d999-48ea-b58e-782cffb24e85' height="225px" width="403px">
2626
</div>
2727

28-
Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [hgemm benchmark](./hgemm) for more details.
28+
Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA)` implemented in this repo (`blue`🔵) can achieve `95%~99%` of its (`orange`🟠) performance. Please check [toy-hgemm library🔥🔥](./hgemm) for more details.
2929

3030
|CUDA Cores|Sliced K(Loop over K)|Tile Block|Tile Thread|
3131
|:---:|:---:|:---:|:---:|
@@ -202,26 +202,27 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's d
202202
| ✔️ [sgemm_t_8x8_sliced_k16...async](./sgemm/sgemm_async.cu)|f32|f32|[link](./sgemm/)|⭐️⭐️⭐️|
203203
| ✔️ [sgemm_wmma_m16n16k8...stages*](./sgemm/sgemm_wmma_tf32_stage.cu)|tf32|f32|[link](./sgemm/)|⭐️⭐️⭐️|
204204
| ✔️ [sgemm_wmma_m16n16k8...swizzle*](./sgemm/sgemm_wmma_tf32_stage.cu)|tf32|f32|[link](./sgemm/)|⭐️⭐️⭐️|
205-
| ✔️ [hgemm_naive_f16](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️|
206-
| ✔️ [hgemm_sliced_k_f16](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
205+
| ✔️ [hgemm_naive_f16](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️|
206+
| ✔️ [hgemm_sliced_k_f16](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
207207
| ✔️ [hgemm_t_8x8_sliced_k_f16x4](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
208-
| ✔️ [hgemm_t_8x8_sliced_k_f16x4_pack](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
209-
| ✔️ [hgemm_t_8x8_sliced_k_f16x8_pack](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
210-
| ✔️ [hgemm_t_8x8_sliced_k...dbuf](./hgemm/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
211-
| ✔️ [hgemm_t_8/16x8...k16/32...dbuf](./hgemm/hgemm_async.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
212-
| ✔️ [hgemm_t_8/16x8...k16/32...async](./hgemm/hgemm_async.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
213-
| ✔️ [hgemm_wmma_m16n16k16...naive*](./hgemm/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
214-
| ✔️ [hgemm_wmma_m16n16k16...mma4x2*](./hgemm/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
215-
| ✔️ [hgemm_wmma_m16n16k16...mma4x4*](./hgemm/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
216-
| ✔️ [hgemm_wmma_m16n16k16...dbuf*](./hgemm/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
217-
| ✔️ [hgemm_wmma_m32n8k16....dbuf*](./hgemm/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
218-
| ✔️ [hgemm_wmma_m16n16k16...stages*](./hgemm/hgemm_wmma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
219-
| ✔️ [hgemm_wmma_m16n16k16...swizzle*](./hgemm/hgemm_wmma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
220-
| ✔️ [hgemm_mma_m16n8k16...naive*](./hgemm/hgemm_mma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
221-
| ✔️ [hgemm_mma_m16n8k16...mma2x4*](./hgemm/hgemm_mma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
222-
| ✔️ [hgemm_mma_m16n8k16...stages*](./hgemm/hgemm_mma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
223-
| ✔️ [hgemm_mma_m16n8k16...swizzle*](./hgemm/hgemm_mma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
224-
| ✔️ [hgemm_mma_stages{swizzle}...cute*](./hgemm/hgemm_mma_stage_tn_cute.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
208+
| ✔️ [hgemm_t_8x8_sliced_k_f16x4_pack](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
209+
| ✔️ [hgemm_t_8x8_sliced_k_f16x8_pack](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
210+
| ✔️ [hgemm_t_8x8_sliced_k...dbuf](./hgemm/naive/hgemm.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
211+
| ✔️ [hgemm_t_8/16x8...k16/32...dbuf](./hgemm/naive/hgemm_async.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
212+
| ✔️ [hgemm_t_8/16x8...k16/32...async](./hgemm/naive/hgemm_async.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
213+
| ✔️ [hgemm_wmma_m16n16k16...naive*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
214+
| ✔️ [hgemm_wmma_m16n16k16...mma4x2*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
215+
| ✔️ [hgemm_wmma_m16n16k16...mma4x4*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
216+
| ✔️ [hgemm_wmma_m16n16k16...dbuf*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
217+
| ✔️ [hgemm_wmma_m32n8k16....dbuf*](./hgemm/wmma/hgemm_wmma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
218+
| ✔️ [hgemm_wmma_m16n16k16...stages*](./hgemm/wmma/hgemm_wmma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
219+
| ✔️ [hgemm_wmma_m16n16k16...swizzle*](./hgemm/wmma/hgemm_wmma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
220+
| ✔️ [hgemm_mma_m16n8k16...naive*](./hgemm/mma/hgemm_mma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
221+
| ✔️ [hgemm_mma_m16n8k16...mma2x4*](./hgemm/mma/hgemm_mma.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
222+
| ✔️ [hgemm_mma_m16n8k16...stages*](./hgemm/mma/hgemm_mma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
223+
| ✔️ [hgemm_mma_m16n8k16...swizzle*](./hgemm/mma/hgemm_mma_stage.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
224+
| ✔️ [hgemm_mma_stages{swizzle}...cute*](./hgemm/cutlass/hgemm_mma_stage_tn_cute.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️⭐️|
225+
| ✔️ [hgemm_mma_cublas*](./hgemm/cublas/hgemm_cublas.cu)|f16|f16|[link](./hgemm/)|⭐️⭐️|
225226
| ✔️ [sgemv_k32_f32](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|
226227
| ✔️ [sgemv_k128_f32x4](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|
227228
| ✔️ [sgemv_k16_f32](./sgemv/sgemv.cu)|f32|f32|[link](./sgemv/)|⭐️⭐️⭐️|

hgemm/.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,9 @@ __pycache__
1818
*.engine
1919
*.bin
2020
*.out
21+
*bin
22+
bin
23+
output
24+
*.egg-info
25+
*.whl
26+
dist

hgemm/README.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,4 @@
1-
# HGEMM
2-
3-
## HGEMM/SGEMM Supported Matrix
1+
# 🔥🔥Toy-HGEMM Library: Achieve the performance of cuBLAS
42

53
|CUDA Cores|Sliced K(Loop over K)|Tile Block|Tile Thread|
64
|:---:|:---:|:---:|:---:|
@@ -45,6 +43,13 @@
4543

4644
</details>
4745

46+
## 安装
47+
本仓库实现的HGEMM CUDA kernels可以作为一个python库toy-hgemm使用,安装命令如下。(可选)
48+
```bash
49+
git submodule update --init --recursive --force
50+
bash tools/install.sh # pip uninstall toy-hgemm 卸载
51+
```
52+
4853
## 测试命令
4954

5055
**CUTLASS**: 更新CUTLASS依赖库
@@ -154,7 +159,7 @@ python3 hgemm.py --cute-tn --mma --wmma-all --plot
154159

155160
在NVIDIA GeForce RTX 3080 Laptop上测试,使用mma4x4_warp4x4(16 WMMA m16n16k16 ops, warp tile 64x64)以及Thread block swizzle,大部分case能持平甚至超过cuBLAS,使用Windows WSL2 + RTX 3080 Laptop进行测试。
156161

157-
![](./NVIDIA_GeForce_RTX_3080_Laptop_GPU_WSL2.png)
162+
![](./bench/NVIDIA_GeForce_RTX_3080_Laptop_GPU_WSL2.png)
158163

159164
```bash
160165
python3 hgemm.py --wmma-all --plot
@@ -175,7 +180,7 @@ sm80_xmma_gemm_f16f16_f16f32_f32_nn_n_tilesize96x64x32_stage3_warpsize2x2x1_tens
175180
```
176181
因此,只有实现使用Tensor Cores的HGEMM,才有可能接近PyTorch/cuBLAS的性能。
177182
```bash
178-
ncu -o hgemm.prof -f python3 prof.py
183+
ncu -o hgemm.prof -f python3 bench/prof.py
179184
nsys profile --stats=true -t cuda,osrt,nvtx -o hgemm.prof --force-overwrite true python3 prof.py
180185
```
181186
- SASS (L20)
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)