Skip to content

Commit c8089fe

Browse files
authored
Update README.md
1 parent 76b7464 commit c8089fe

File tree

1 file changed

+66
-32
lines changed

1 file changed

+66
-32
lines changed

kernels/hgemm/README.md

Lines changed: 66 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1-
# ⚡️⚡️Toy-HGEMM Library: Achieve the performance of cuBLAS
1+
2+
## ⚡️⚡️Toy-HGEMM Library: Achieve the 98%~100% performance of cuBLAS
23

34
![toy-hgemm-library](https://github.com/user-attachments/assets/962bda14-b494-4423-b8eb-775da9f5503d)
45

5-
[📖Toy-HGEMM Library⚡️⚡️](./kernels/hgemm) is library that write HGEMM kernels from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API, thus, can achieve `98%~100%` performance of **cuBLAS**. The codes here are sources from 📖[CUDA-Learn-Notes](https://github.com/DefTruth/CUDA-Learn-Notes) ![](https://img.shields.io/github/stars/DefTruth/CUDA-Learn-Notes.svg?style=social) as a story, please checkout 📖[CUDA-Learn-Notes](https://github.com/DefTruth/CUDA-Learn-Notes) for latest updates.
6+
[📖Toy-HGEMM Library⚡️⚡️](./kernels/hgemm) is a library that write many HGEMM kernels from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API, thus, can achieve `98%~100%` performance of **cuBLAS**. The codes here are source from 📖[CUDA-Learn-Notes](https://github.com/DefTruth/CUDA-Learn-Notes) ![](https://img.shields.io/github/stars/DefTruth/CUDA-Learn-Notes.svg?style=social) and exported as a standalone library, please checkout [CUDA-Learn-Notes](https://github.com/DefTruth/CUDA-Learn-Notes) for latest updates. Welcome to 🌟👆🏻star this repo to support me, thanks ~ 🎉🎉
67

78
<div id="hgemm-sgemm"></div>
89

@@ -27,6 +28,18 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's d
2728
|Collective Store (Warp Shuffle & Reg Reuse)|Row Major (NN)|Col Major (TN)|SGEMM FP32/TF32|
2829
|✔️|✔️|✔️|✔️|
2930

31+
## ©️Citations🎉🎉
32+
33+
```BibTeX
34+
@misc{hgemm-tensorcores-mma@2024,
35+
title={hgemm-tensorcores-mma: Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API.},
36+
url={https://github.com/DefTruth/hgemm-tensorcores-mma},
37+
note={Open-source software available at https://github.com/DefTruth/hgemm-tensorcores-mma},
38+
author={DefTruth etc},
39+
year={2024}
40+
}
41+
```
42+
3043
## 📖 HGEMM CUDA Kernels in Toy-HGEMM Library 🎉🎉
3144

3245
<div id="kernels"></div>
@@ -70,39 +83,40 @@ void hgemm_mma_stages_block_swizzle_tn_cute(torch::Tensor a, torch::Tensor b, to
7083
7184
## 📖 目录
7285
73-
- [📖 安装](#install)
74-
- [📖 测试](#test)
75-
- [📖 NVIDIA L20 性能数据](#perf-l20)
76-
- [📖 NVIDIA RTX 4090 性能数据](#perf-4090)
77-
- [📖 NVIDIA RTX 3080 Laptop 性能数据](#perf-3080)
78-
- [📖 性能优化笔记](#opt-docs)
79-
- [📖 参考文献](#ref)
86+
- [📖 Installation](#install)
87+
- [📖 Python/C++ Testing](#test)
88+
- [📖 NVIDIA L20 bench](#perf-l20)
89+
- [📖 NVIDIA RTX 4090 bench](#perf-4090)
90+
- [📖 NVIDIA RTX 3080 Laptop bench](#perf-3080)
91+
- [📖 Docs](#opt-docs)
92+
- [📖 References](#ref)
8093
81-
## 📖 安装
94+
## 📖 Installation
8295
8396
<div id="install"></div>
8497
85-
本仓库实现的HGEMM可以作为一个python库使用(可选)
98+
The HGEMM implemented in this repo can be install as a python library, namely, `toy-hgemm` library (optional)
8699
```bash
87100
cd kernels/hgemm
88-
git submodule update --init --recursive --force # 更新cutlass, 必须
89-
python3 setup.py bdist_wheel && cd dist && python3 -m pip install *.whl # pip uninstall toy-hgemm -y 卸载
101+
git submodule update --init --recursive --force # Fetch `CUTLASS` submodule, needed
102+
python3 setup.py bdist_wheel && cd dist && python3 -m pip install *.whl # pip uninstall toy-hgemm -y
90103
```
91104

92-
## 📖 测试
105+
## 📖 Python/C++ Testing
93106

94107
<div id="test"></div>
95108

96-
**CUTLASS**: 更新CUTLASS依赖库
109+
**CUTLASS**: Fetch `CUTLASS` submodule. Currently, I use `v3.5.1` for HGEMM CuTe kernel.
97110
```bash
98111
git submodule update --init --recursive --force
99112
```
100113

101-
**Python**: 支持Python脚本直接测试
114+
**Python**: Test many custom HGEMM kernel via Python script and figure out the difference in their performance.
102115

103116
```bash
104-
# 只测试Ada架构 不指定默认编译所有架构 耗时较长: Volta, Ampere, Ada, Hopper, ...
105-
export TORCH_CUDA_ARCH_LIST=Ada
117+
# You can test Ada or Ampere only, also, Volta, Ampere, Ada, Hopper, ...
118+
export TORCH_CUDA_ARCH_LIST=Ada # for Ada only
119+
export TORCH_CUDA_ARCH_LIST=Ampere # for Ampere only
106120
python3 hgemm.py --wmma # test defalut wmma kernels for all MNK
107121
python3 hgemm.py --mma # test defalut mma kernels for all MNK
108122
python3 hgemm.py --M 16384 --N 16384 --K 8192 --i 10 --wmma # test default wmma kernels for specific MNK
@@ -112,16 +126,21 @@ python3 hgemm.py --mma-all # test all mma kernels for all MNK
112126
python3 hgemm.py --cuda-all --wmma-all --mma-all # test all kernels for all MNK
113127
python3 hgemm.py --cute-tn --no-default # test cute hgemm kernels with smem swizzle for all MNK
114128
```
115-
如果需要绘制TFLOPS曲线图,需要先安装matplotlib,并指定--plot-flops(或--plot)选项:
129+
If you want to draw a TFLOPS curve, you need to install `matplotlib` first and set the --plot-flops (or --plot) option.
116130
```bash
117131
python3 -m pip install matplotlib
118-
# topk指定只绘制性能最好的topk个kernel
132+
# Specify topk to plot only the top k kernels with the best performance.
119133
python3 hgemm.py --mma-all --plot --topk 8
120134
# test default mma kernels & cute hgemm kernels with smem swizzle for all MNK
121135
python3 hgemm.py --cute-tn --mma --plot
122136
```
137+
**C++**: The HGEMM benchmark also supports C++ testing. Currently, it supports comparisons between the following implementations:
138+
139+
- MMA HGEMM NN implemented in this repository
140+
- CuTe HGEMM TN implemented in this repository
141+
- cuBLAS HGEMM TN use default Tensor Cores math algorithm
123142

124-
**C++**: HGEMM benchmark也支持C++测试,目前支持本仓库实现的 MMA HGEMM NN, CuTe HGEMM TN 和 cuBLAS HGEMM TN 进行对比,C++ bin方式测试的性能数据会略优于Python测试方式,可能是PyTorch Python binding引入了一定的额外开销。
143+
Performance data obtained from C++ binary tests tend to be slightly better than those from Python tests. This difference may be attributed to additional overhead introduced by the PyTorch Python bindings.
125144
```bash
126145
make
127146
./hgemm_mma_stage.bin
@@ -155,21 +174,32 @@ M N K = 16128 16128 16128, Time = 0.07319142 0.07320709 0.07326925 s, A
155174
M N K = 16384 16384 16384, Time = 0.07668429 0.07669371 0.07670784 s, AVG Performance = 114.6912 Tflops
156175
```
157176

158-
## 📖 目前性能
177+
## 📖 Benchmark
159178

160179
<div id="perf-l20"></div>
161180

162181
### NVIDIA L20
163-
182+
<!--
164183
目前最优的实现,在L20上(理论Tensor Cores FP16算力为 119.5 TFLOPS),整体上能达到cuBLAS大概`99~100+%`左右的性能。使用WMMA API能达到cuBLAS大概`95%~98%`左右的性能(105-113 TFLOPS vs 105-115 TFLOPS),使用MMA API能达到115 TFLOPS,部分 case 会超越 cuBLAS。CuTe 版本的 HGEMM 实现了 Block Swizzle(L2 Cache friendly)和 SMEM Swizzle(bank conflicts free),性能最优,大规模矩阵乘能达到 116-117 TFLOPS,是 cuBLAS 大概`98%~100%+`左右的性能,很多case会超越cuBLAS。目前通过 SMEM Padding 和 SMEM Swizzle 的方式缓解 bank conflicts。对于 NN layout,使用 SMEM Padding 缓解 bank conflicts;对于 TN layout,通过 CUTLASS/CuTe 的 SMEM Swizzle 消除 bank conflicts。
184+
-->
185+
The current best implementation, on the L20 (with a theoretical Tensor Cores FP16 performance of 119.5 TFLOPS), achieves performance that is approximately 99~100+% of cuBLAS.
186+
187+
- Using the WMMA API, it can achieve around 95%~98% of cuBLAS performance (105-113 TFLOPS vs 105-115 TFLOPS).
188+
- Using the MMA API, it can reach 115 TFLOPS, surpassing cuBLAS in some cases.
189+
- The CuTe version of HGEMM implements Block Swizzle (L2 Cache friendly) and SMEM Swizzle (bank conflicts free), achieving the best performance. For large-scale matrix multiplication, it can reach 116-117 TFLOPS, which is approximately 98%~100%+ of cuBLAS performance, and it outperforms cuBLAS in many cases.
190+
191+
Currently, SMEM Padding and SMEM Swizzle are used to mitigate bank conflicts:
192+
193+
- For the NN layout, SMEM Padding is used to alleviate bank conflicts.
194+
- For the TN layout, CUTLASS/CuTe's SMEM Swizzle is used to eliminate bank conflicts.
165195

166196
<div id="NV-L20"></div>
167197

168198

169199
![NVIDIA_L20_NN+TN+v2](https://github.com/user-attachments/assets/71927ac9-72b3-4ce9-b0e2-788b5885bc99)
170200

171201

172-
全量MNK测试命令(提示: 每个MNK单独测试的性能数据更准确)
202+
The command for testing all MNK setups (Tip: Performance data for each MNK tested individually is more accurate.)
173203
```bash
174204
python3 hgemm.py --cute-tn --mma --plot
175205
```
@@ -178,7 +208,14 @@ python3 hgemm.py --cute-tn --mma --plot
178208

179209
<div id="perf-4090"></div>
180210

211+
<!--
181212
在NVIDIA RTX 4090上(FP16 Tensor Cores算力为330 TFLOPS),WMMA(m16n16k16)性能表现比MMA(m16n8k16)要更好,大分部MNK下,本仓库的实现能达到cuBLAS 95%~99%的性能,某些case能超过cuBLAS。就本仓库的实现而言,在RTX 4090上,大规模矩阵乘(MNK>=8192),WMMA表现更优,小规模矩阵乘,MMA表现更优。
213+
-->
214+
215+
On the NVIDIA RTX 4090 (with an FP16 Tensor Cores performance of 330 TFLOPS), the WMMA (m16n16k16) implementation shows better performance compared to MMA (m16n8k16). For most MNK configurations, this repository's implementation achieves 95%~99% of cuBLAS performance, and in certain cases, it can surpass cuBLAS. Specifically:
216+
217+
- For large-scale matrix multiplications (MNK >= 8192), the WMMA implementation performs better.
218+
- For small-scale matrix multiplications, the MMA implementation is more efficient.
182219

183220

184221
![NVIDIA_GeForce_RTX_4090_NN+TN+v4](https://github.com/user-attachments/assets/05ef4f5e-d999-48ea-b58e-782cffb24e85)
@@ -191,8 +228,10 @@ python3 hgemm.py --cute-tn --mma --wmma-all --plot
191228

192229
<div id="perf-3080"></div>
193230

231+
<!--
194232
在NVIDIA GeForce RTX 3080 Laptop上测试,使用mma4x4_warp4x4(16 WMMA m16n16k16 ops, warp tile 64x64)以及Thread block swizzle,大部分case能持平甚至超过cuBLAS,使用Windows WSL2 + RTX 3080 Laptop进行测试。
195-
233+
-->
234+
Testing was conducted on a NVIDIA GeForce RTX 3080 Laptop using the mma4x4_warp4x4 configuration (which includes 16 WMMA m16n16k16 operations with a warp tile size of 64x64) along with Thread block swizzle. In most cases, this setup matches or even exceeds cuBLAS performance. The tests were performed using Windows WSL2 + RTX 3080 Laptop.
196235

197236
![image](https://github.com/user-attachments/assets/9472e970-c083-4b31-9252-3eeecc761078)
198237

@@ -201,9 +240,9 @@ python3 hgemm.py --wmma-all --plot
201240
```
202241

203242
<details>
204-
<summary> 🔑️ 性能优化笔记(TODO) !Click here! </summary>
243+
<summary> 🔑️ Performance Optimization Notes(TODO) !Click here! </summary>
205244

206-
## 📖 性能优化笔记
245+
## 📖 Performance Optimization Notes
207246

208247
<div id="opt-docs"></div>
209248

@@ -312,11 +351,6 @@ TODO
312351
313352
<div id="ref"></div>
314353
315-
- [CUDA编程概念】一、什么是bank conflict?](https://zhuanlan.zhihu.com/p/659142274)
316-
- [解决 bank conflict](https://github.com/PaddleJitLab/CUDATutorial/blob/develop/docs/09_optimize_reduce/02_bank_conflict/README.md)
317-
- [Bank Conflict free 的几种方式](https://zhuanlan.zhihu.com/p/722286440)
318-
- [Using Shared Memory in CUDA C/C++](https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/)
319-
- [CUDA(三):通用矩阵乘法:从入门到熟练](https://zhuanlan.zhihu.com/p/657632577)
320354
- [flash-attention-minimal](https://github.com/tspeterkim/flash-attention-minimal)
321355
- [tiny-flash-attention](https://github.com/66RING/tiny-flash-attention)
322356
- [cute-gemm](https://github.com/reed-lau/cute-gemm)

0 commit comments

Comments
 (0)