You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[📖Toy-HGEMM Library⚡️⚡️](./kernels/hgemm) is library that write HGEMM kernels from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API, thus, can achieve `98%~100%` performance of **cuBLAS**. The codes here are sources from 📖[CUDA-Learn-Notes](https://github.com/DefTruth/CUDA-Learn-Notes) as a story, please checkout 📖[CUDA-Learn-Notes](https://github.com/DefTruth/CUDA-Learn-Notes) for latest updates.
6
+
[📖Toy-HGEMM Library⚡️⚡️](./kernels/hgemm) is a library that write many HGEMM kernels from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API, thus, can achieve `98%~100%` performance of **cuBLAS**. The codes here are source from 📖[CUDA-Learn-Notes](https://github.com/DefTruth/CUDA-Learn-Notes)and exported as a standalone library, please checkout [CUDA-Learn-Notes](https://github.com/DefTruth/CUDA-Learn-Notes) for latest updates. Welcome to 🌟👆🏻star this repo to support me, thanks ~ 🎉🎉
6
7
7
8
<divid="hgemm-sgemm"></div>
8
9
@@ -27,6 +28,18 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's d
27
28
|Collective Store (Warp Shuffle & Reg Reuse)|Row Major (NN)|Col Major (TN)|SGEMM FP32/TF32|
Performance data obtained from C++ binary tests tend to be slightly better than those from Python tests. This difference may be attributed to additional overhead introduced by the PyTorch Python bindings.
125
144
```bash
126
145
make
127
146
./hgemm_mma_stage.bin
@@ -155,21 +174,32 @@ M N K = 16128 16128 16128, Time = 0.07319142 0.07320709 0.07326925 s, A
155
174
M N K = 16384 16384 16384, Time = 0.07668429 0.07669371 0.07670784 s, AVG Performance = 114.6912 Tflops
156
175
```
157
176
158
-
## 📖 目前性能
177
+
## 📖 Benchmark
159
178
160
179
<divid="perf-l20"></div>
161
180
162
181
### NVIDIA L20
163
-
182
+
<!--
164
183
目前最优的实现,在L20上(理论Tensor Cores FP16算力为 119.5 TFLOPS),整体上能达到cuBLAS大概`99~100+%`左右的性能。使用WMMA API能达到cuBLAS大概`95%~98%`左右的性能(105-113 TFLOPS vs 105-115 TFLOPS),使用MMA API能达到115 TFLOPS,部分 case 会超越 cuBLAS。CuTe 版本的 HGEMM 实现了 Block Swizzle(L2 Cache friendly)和 SMEM Swizzle(bank conflicts free),性能最优,大规模矩阵乘能达到 116-117 TFLOPS,是 cuBLAS 大概`98%~100%+`左右的性能,很多case会超越cuBLAS。目前通过 SMEM Padding 和 SMEM Swizzle 的方式缓解 bank conflicts。对于 NN layout,使用 SMEM Padding 缓解 bank conflicts;对于 TN layout,通过 CUTLASS/CuTe 的 SMEM Swizzle 消除 bank conflicts。
184
+
-->
185
+
The current best implementation, on the L20 (with a theoretical Tensor Cores FP16 performance of 119.5 TFLOPS), achieves performance that is approximately 99~100+% of cuBLAS.
186
+
187
+
- Using the WMMA API, it can achieve around 95%~98% of cuBLAS performance (105-113 TFLOPS vs 105-115 TFLOPS).
188
+
- Using the MMA API, it can reach 115 TFLOPS, surpassing cuBLAS in some cases.
189
+
- The CuTe version of HGEMM implements Block Swizzle (L2 Cache friendly) and SMEM Swizzle (bank conflicts free), achieving the best performance. For large-scale matrix multiplication, it can reach 116-117 TFLOPS, which is approximately 98%~100%+ of cuBLAS performance, and it outperforms cuBLAS in many cases.
190
+
191
+
Currently, SMEM Padding and SMEM Swizzle are used to mitigate bank conflicts:
192
+
193
+
- For the NN layout, SMEM Padding is used to alleviate bank conflicts.
194
+
- For the TN layout, CUTLASS/CuTe's SMEM Swizzle is used to eliminate bank conflicts.
On the NVIDIA RTX 4090 (with an FP16 Tensor Cores performance of 330 TFLOPS), the WMMA (m16n16k16) implementation shows better performance compared to MMA (m16n8k16). For most MNK configurations, this repository's implementation achieves 95%~99% of cuBLAS performance, and in certain cases, it can surpass cuBLAS. Specifically:
216
+
217
+
- For large-scale matrix multiplications (MNK >= 8192), the WMMA implementation performs better.
218
+
- For small-scale matrix multiplications, the MMA implementation is more efficient.
Testing was conducted on a NVIDIA GeForce RTX 3080 Laptop using the mma4x4_warp4x4 configuration (which includes 16 WMMA m16n16k16 operations with a warp tile size of 64x64) along with Thread block swizzle. In most cases, this setup matches or even exceeds cuBLAS performance. The tests were performed using Windows WSL2 + RTX 3080 Laptop.
0 commit comments