Skip to content

Commit edd8012

Browse files
authored
[HGEMM] Add HGEMM L20/4090 benchmark figures (#124)
* Update README.md * Update README.md * Update README.md * Update README.md
1 parent 353c947 commit edd8012

File tree

2 files changed

+29
-19
lines changed

2 files changed

+29
-19
lines changed

README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,9 @@
1111

1212
🎉 **Modern CUDA Learn Notes with PyTorch** for **Beginners**: **fp32/tf32, fp16/bf16, fp8/int8, Tensor/CUDA Cores**, flash_attn, rope, embedding, sgemm, sgemv, hgemm, hgemv, warp/block reduce, dot prod, elementwise, sigmoid, relu, gelu, softmax, layernorm, rmsnorm, hist and some CUDA optimization techniques (pack LDST, cp.async, warp gemv, sliced_k/split_k/pipeline gemm, bank conflicts reduce, WMMA/MMA, block/warp swizzle, etc).
1313

14+
<!---
1415
<img width="1438" alt="image" src="https://github.com/user-attachments/assets/0c5e5125-586f-43fa-8e8b-e2c61c1afbbe">
16+
--->
1517

1618
### 📖 HGEMM/SGEMM Supported Matrix
1719

@@ -29,7 +31,9 @@
2931

3032
Currently, on NVIDIA L20, RTX 4090 and RTX 3090 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA and MMA)` implemented in this repo can achieve approximately `95%~98%` of its performance. Please check [hgemm benchmark](./hgemm) for more details.
3133

32-
![](./hgemm/NVIDIA_GeForce_RTX_3080_Laptop_GPU_WSL2.png)
34+
![L20](https://github.com/user-attachments/assets/a0039200-cd9e-4ae6-be13-422fff75dd2b)
35+
36+
![4090](https://github.com/user-attachments/assets/c7d65fe5-9fb9-49a8-b962-a6c09bcc030a)
3337

3438
## 📖 CUDA Kernel目录 (面试常考题目)
3539
- / = not supported now.

hgemm/README.md

Lines changed: 24 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -66,30 +66,14 @@ python3 hgemm.py --mma-all --plot --topk 8
6666

6767
## 目前性能
6868

69-
### NVIDIA GeForce RTX 3080 Laptop
70-
71-
在NVIDIA GeForce RTX 3080 Laptop上测试,使用mma4x4_warp4x4(16 WMMA m16n16k16 ops, warp tile 64x64)以及Thread block swizzle,大部分case能持平甚至超过cuBLAS,使用Windows WSL2 + RTX 3080 Laptop进行测试。
72-
73-
![](./NVIDIA_GeForce_RTX_3080_Laptop_GPU_WSL2.png)
74-
75-
```bash
76-
python3 hgemm.py --wmma-all
77-
----------------------------------------------------------------------------------------------------------------------------------
78-
M=16384, N=16384, K=8192, Warmup=5, Iters=20, 27/27
79-
----------------------------------------------------------------------------------------------------------------------------------
80-
(wmma4x4+warp4x4+stage3+dsmem): ['68.375 ', '-2.234375 '], time:96.91984ms, swizzle: NOOP, TFLOPS: 45.38 (+0.00%)
81-
(wmma4x4+warp4x4+stage2+dsmem): ['68.375 ', '-2.234375 '], time:102.8722ms, swizzle: NOOP, TFLOPS: 42.75
82-
(wmma4x4+warp4x4+stage3+dsmem+swizzle): ['68.375 ', '-2.234375 '], time:85.65800ms, swizzle: 4096, TFLOPS: 51.34 (+13.15%)
83-
(wmma4x4+warp4x4+stage2+dsmem+swizzle): ['68.375 ', '-2.234375 '], time:95.70884ms, swizzle: 4096, TFLOPS: 45.95
84-
(cublas): ['68.375 ', '-2.234375 '], time:104.2092ms, swizzle: NOOP, TFLOPS: 42.20
85-
----------------------------------------------------------------------------------------------------------------------------------
86-
```
8769
### NVIDIA L20
8870

8971
目前最优的实现,在L20上(理论Tensor Cores FP16算力为 119.5 TFLOPS),使用WMMA API能达到cuBLAS大概95%~98%左右的性能(105-113 TFLOPS vs 105-115 TFLOPS),使用MMA API能达到115 TFLOPS,部分case会超越cuBLAS。已知问题为bank conflicts没有完全消除,目前通过padding的方式缓解bank conflicts会导致shared memory浪费,也会影响SM occupancy。并且尚未手工实现smem swizzle/permute(受限于WMMA API的灵活性以及row major的layout),后续将会尝试通过MMA PTX实现smem swizzle/permute。
9072

9173
<div id="NV-L20"></div>
9274

75+
![L20](https://github.com/user-attachments/assets/a0039200-cd9e-4ae6-be13-422fff75dd2b)
76+
9377
- WMMA: Up to 113.76 TFLOPS, 113.83/119.5=95.25% TFLOPS utilization, 113.83/116.25=97.91% cuBLAS performance.
9478
- MMA: Up to 115.12 TFLOPS, 115.12/119.5=96.33% TFLOPS utilization, 115.12/116.25=99.03% cuBLAS performance.
9579

@@ -120,6 +104,9 @@ python3 hgemm.py --mma-all --wmma-all --cuda-all
120104

121105
### NVIDIA GeForce RTX 4090
122106
在NVIDIA RTX 4090上(FP16 Tensor Cores算力为330 TFLOPS),WMMA(m16n16k16)性能表现比MMA(m16n8k16)要更好,大分部MNK下,本仓库的实现能达到cuBLAS 95%~99%的性能,某些case能超过cuBLAS。就本仓库的实现而言,在RTX 4090上,大规模矩阵乘(MNK>=8192),WMMA表现更优,小规模矩阵乘,MMA表现更优。
107+
108+
![4090](https://github.com/user-attachments/assets/c7d65fe5-9fb9-49a8-b962-a6c09bcc030a)
109+
123110
```bash
124111
----------------------------------------------------------------------------------------------------------------------------------
125112
M=16384, N=16384, K=8192, Warmup=2, Iters=10, 1/1
@@ -165,6 +152,25 @@ python3 hgemm.py --mma-all --wmma-all --cuda-all
165152
----------------------------------------------------------------------------------------------------------------------------------
166153
```
167154

155+
### NVIDIA GeForce RTX 3080 Laptop
156+
157+
在NVIDIA GeForce RTX 3080 Laptop上测试,使用mma4x4_warp4x4(16 WMMA m16n16k16 ops, warp tile 64x64)以及Thread block swizzle,大部分case能持平甚至超过cuBLAS,使用Windows WSL2 + RTX 3080 Laptop进行测试。
158+
159+
![](./NVIDIA_GeForce_RTX_3080_Laptop_GPU_WSL2.png)
160+
161+
```bash
162+
python3 hgemm.py --wmma-all
163+
----------------------------------------------------------------------------------------------------------------------------------
164+
M=16384, N=16384, K=8192, Warmup=5, Iters=20, 27/27
165+
----------------------------------------------------------------------------------------------------------------------------------
166+
(wmma4x4+warp4x4+stage3+dsmem): ['68.375 ', '-2.234375 '], time:96.91984ms, swizzle: NOOP, TFLOPS: 45.38 (+0.00%)
167+
(wmma4x4+warp4x4+stage2+dsmem): ['68.375 ', '-2.234375 '], time:102.8722ms, swizzle: NOOP, TFLOPS: 42.75
168+
(wmma4x4+warp4x4+stage3+dsmem+swizzle): ['68.375 ', '-2.234375 '], time:85.65800ms, swizzle: 4096, TFLOPS: 51.34 (+13.15%)
169+
(wmma4x4+warp4x4+stage2+dsmem+swizzle): ['68.375 ', '-2.234375 '], time:95.70884ms, swizzle: 4096, TFLOPS: 45.95
170+
(cublas): ['68.375 ', '-2.234375 '], time:104.2092ms, swizzle: NOOP, TFLOPS: 42.20
171+
----------------------------------------------------------------------------------------------------------------------------------
172+
```
173+
168174

169175
## 性能优化笔记
170176

0 commit comments

Comments
 (0)