@@ -66,30 +66,14 @@ python3 hgemm.py --mma-all --plot --topk 8
66
66
67
67
## 目前性能
68
68
69
- ### NVIDIA GeForce RTX 3080 Laptop
70
-
71
- 在NVIDIA GeForce RTX 3080 Laptop上测试,使用mma4x4_warp4x4(16 WMMA m16n16k16 ops, warp tile 64x64)以及Thread block swizzle,大部分case能持平甚至超过cuBLAS,使用Windows WSL2 + RTX 3080 Laptop进行测试。
72
-
73
- ![ ] ( ./NVIDIA_GeForce_RTX_3080_Laptop_GPU_WSL2.png )
74
-
75
- ``` bash
76
- python3 hgemm.py --wmma-all
77
- ----------------------------------------------------------------------------------------------------------------------------------
78
- M=16384, N=16384, K=8192, Warmup=5, Iters=20, 27/27
79
- ----------------------------------------------------------------------------------------------------------------------------------
80
- (wmma4x4+warp4x4+stage3+dsmem): [' 68.375 ' , ' -2.234375 ' ], time:96.91984ms, swizzle: NOOP, TFLOPS: 45.38 (+0.00%)
81
- (wmma4x4+warp4x4+stage2+dsmem): [' 68.375 ' , ' -2.234375 ' ], time:102.8722ms, swizzle: NOOP, TFLOPS: 42.75
82
- (wmma4x4+warp4x4+stage3+dsmem+swizzle): [' 68.375 ' , ' -2.234375 ' ], time:85.65800ms, swizzle: 4096, TFLOPS: 51.34 (+13.15%)
83
- (wmma4x4+warp4x4+stage2+dsmem+swizzle): [' 68.375 ' , ' -2.234375 ' ], time:95.70884ms, swizzle: 4096, TFLOPS: 45.95
84
- (cublas): [' 68.375 ' , ' -2.234375 ' ], time:104.2092ms, swizzle: NOOP, TFLOPS: 42.20
85
- ----------------------------------------------------------------------------------------------------------------------------------
86
- ```
87
69
### NVIDIA L20
88
70
89
71
目前最优的实现,在L20上(理论Tensor Cores FP16算力为 119.5 TFLOPS),使用WMMA API能达到cuBLAS大概95%~ 98%左右的性能(105-113 TFLOPS vs 105-115 TFLOPS),使用MMA API能达到115 TFLOPS,部分case会超越cuBLAS。已知问题为bank conflicts没有完全消除,目前通过padding的方式缓解bank conflicts会导致shared memory浪费,也会影响SM occupancy。并且尚未手工实现smem swizzle/permute(受限于WMMA API的灵活性以及row major的layout),后续将会尝试通过MMA PTX实现smem swizzle/permute。
90
72
91
73
<div id =" NV-L20 " ></div >
92
74
75
+ ![ L20] ( https://github.com/user-attachments/assets/a0039200-cd9e-4ae6-be13-422fff75dd2b )
76
+
93
77
- WMMA: Up to 113.76 TFLOPS, 113.83/119.5=95.25% TFLOPS utilization, 113.83/116.25=97.91% cuBLAS performance.
94
78
- MMA: Up to 115.12 TFLOPS, 115.12/119.5=96.33% TFLOPS utilization, 115.12/116.25=99.03% cuBLAS performance.
95
79
@@ -120,6 +104,9 @@ python3 hgemm.py --mma-all --wmma-all --cuda-all
120
104
121
105
### NVIDIA GeForce RTX 4090
122
106
在NVIDIA RTX 4090上(FP16 Tensor Cores算力为330 TFLOPS),WMMA(m16n16k16)性能表现比MMA(m16n8k16)要更好,大分部MNK下,本仓库的实现能达到cuBLAS 95%~ 99%的性能,某些case能超过cuBLAS。就本仓库的实现而言,在RTX 4090上,大规模矩阵乘(MNK>=8192),WMMA表现更优,小规模矩阵乘,MMA表现更优。
107
+
108
+ ![ 4090] ( https://github.com/user-attachments/assets/c7d65fe5-9fb9-49a8-b962-a6c09bcc030a )
109
+
123
110
``` bash
124
111
----------------------------------------------------------------------------------------------------------------------------------
125
112
M=16384, N=16384, K=8192, Warmup=2, Iters=10, 1/1
@@ -165,6 +152,25 @@ python3 hgemm.py --mma-all --wmma-all --cuda-all
165
152
----------------------------------------------------------------------------------------------------------------------------------
166
153
```
167
154
155
+ ### NVIDIA GeForce RTX 3080 Laptop
156
+
157
+ 在NVIDIA GeForce RTX 3080 Laptop上测试,使用mma4x4_warp4x4(16 WMMA m16n16k16 ops, warp tile 64x64)以及Thread block swizzle,大部分case能持平甚至超过cuBLAS,使用Windows WSL2 + RTX 3080 Laptop进行测试。
158
+
159
+ ![ ] ( ./NVIDIA_GeForce_RTX_3080_Laptop_GPU_WSL2.png )
160
+
161
+ ``` bash
162
+ python3 hgemm.py --wmma-all
163
+ ----------------------------------------------------------------------------------------------------------------------------------
164
+ M=16384, N=16384, K=8192, Warmup=5, Iters=20, 27/27
165
+ ----------------------------------------------------------------------------------------------------------------------------------
166
+ (wmma4x4+warp4x4+stage3+dsmem): [' 68.375 ' , ' -2.234375 ' ], time:96.91984ms, swizzle: NOOP, TFLOPS: 45.38 (+0.00%)
167
+ (wmma4x4+warp4x4+stage2+dsmem): [' 68.375 ' , ' -2.234375 ' ], time:102.8722ms, swizzle: NOOP, TFLOPS: 42.75
168
+ (wmma4x4+warp4x4+stage3+dsmem+swizzle): [' 68.375 ' , ' -2.234375 ' ], time:85.65800ms, swizzle: 4096, TFLOPS: 51.34 (+13.15%)
169
+ (wmma4x4+warp4x4+stage2+dsmem+swizzle): [' 68.375 ' , ' -2.234375 ' ], time:95.70884ms, swizzle: 4096, TFLOPS: 45.95
170
+ (cublas): [' 68.375 ' , ' -2.234375 ' ], time:104.2092ms, swizzle: NOOP, TFLOPS: 42.20
171
+ ----------------------------------------------------------------------------------------------------------------------------------
172
+ ```
173
+
168
174
169
175
## 性能优化笔记
170
176
0 commit comments