Skip to content

Commit 245dcff

Browse files
authored
[HGEMM] Update HGEMM Supported Matrix (#116)
* Update hgemm_mma_stage.cu * Update README.md * Update README.md * Update README.md
1 parent df9d781 commit 245dcff

File tree

2 files changed

+56
-12
lines changed

2 files changed

+56
-12
lines changed

hgemm/README.md

Lines changed: 48 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,15 @@
44

55
|CUDA Cores|Sliced K(Loop over K)|Tile Block|Tile Thread|
66
|:---:|:---:|:---:|:---:|
7-
|||||
8-
|**WMMA(m16n16k16)**|**MMA(m16n8k16)**|**Pack LDST**|**SMEM Padding**|
9-
|||||
7+
|✔️|✔️|✔️|✔️|
8+
|**WMMA(m16n16k16)**|**MMA(m16n8k16)**|**Pack LDST(128 bits)**|**SMEM Padding**|
9+
|✔️|✔️|✔️|✔️|
1010
|**Copy Async**|**Tile MMA(More Threads)**|**Tile Warp(More Values)**|**Multi Stages**|
11-
|||||
12-
|**Reg Double Buffers**|**Block Swizzle**|**Warp Swizzle**|**Collective Store(Shuffle)**|
13-
|||||
11+
|✔️|✔️|✔️|✔️|
12+
|**Reg Double Buffers**|**Block Swizzle**|**Warp Swizzle**|**Collective Store(Reg Reuse&Warp Shuffle)**|
13+
|✔️|✔️|✔️|✔️|
1414
|**Row Major(NN)**|**Col Major(TN)**|**SMEM Swizzle**|...|
15-
||||...|
15+
|✔️|✔️||...|
1616

1717
<details>
1818
<summary> 🔑️ 点击查看所有支持的HGEMM Kernels! </summary>
@@ -167,8 +167,15 @@ python3 hgemm.py --M 4096 --N 4096 --K 4096 --mma-all --wmma-all --cuda-all
167167

168168
### PyTorch HGEMM Profile
169169

170-
在Ada架构下,PyTorch 2.4对FP16使用matmul时,会调用ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_32x1_nn kernel,内部实际使用HMMA(Tensor Cores)进行计算,在3080上profile发现使用sm80_xmma_gemm_f16f16_f16f32_f32_nn_n_tilesize96x64x32_stage3_warpsize2x2x1_tensor16x8x16_kernel。因此,只有实现使用Tensor Cores的HGEMM,才有可能接近PyTorch/cuBLAS的性能。
171-
170+
在Ada架构下,PyTorch 2.4对FP16使用matmul时,会调用:
171+
```C++
172+
ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_32x1_nn_kernel
173+
```
174+
内部实际使用HMMA(Tensor Cores)进行计算,在3080上profile发现使用:
175+
```C++
176+
sm80_xmma_gemm_f16f16_f16f32_f32_nn_n_tilesize96x64x32_stage3_warpsize2x2x1_tensor16x8x16_kernel
177+
```
178+
因此,只有实现使用Tensor Cores的HGEMM,才有可能接近PyTorch/cuBLAS的性能。
172179
```bash
173180
ncu -o hgemm.prof -f python3 prof.py
174181
nsys profile --stats=true -t cuda,osrt,nvtx -o hgemm.prof --force-overwrite true python3 prof.py
@@ -183,8 +190,10 @@ nsys profile --stats=true -t cuda,osrt,nvtx -o hgemm.prof --force-overwrite true
183190
...
184191
```
185192

186-
### 共享内存 Bank Conflicts
193+
### SMEM Padding
187194

195+
#### Bank Conflicts的产生
196+
188197
含义:在访问shared memory时,因多个线程读写同一个Bank中的不同数据地址时,导致shared memory 并发读写 退化 成顺序读写的现象叫做Bank Conflict;
189198

190199
![](https://github.com/PaddleJitLab/CUDATutorial/blob/develop/docs/09_optimize_reduce/02_bank_conflict/images/ef322be7c3e5b6b9be69d2b90e88083f50569a58a97129f348e483b946ab4edf.png)
@@ -206,6 +215,18 @@ cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte);
206215
207216
本仓库实现的HGEMM Double Buffers策略如下:1)主循环从bk = 1 开始,第一次数据加载在主循环之前,最后一次计算在主循环之后,这是pipeline 的特点决定的;2)由于计算和下一次访存使用的Shared Memory不同,因此主循环中每次循环只需要一次__syncthreads()即可,对比非double buffers版本,总共节省了 ((K + BK - 1) / BK) - 1 次block内的同步操作。比如,bk=1时,HFMA计算使用的是s_a[0]和s_b[0],因此,和s_a[1]和s_b[1]的加载是没有依赖关系的。HFMA计算,从global内存到s_a[1]和s_b[1]和HFMA计算可以并行。s_a[1]和s_b[1]用于加载下一块BK需要的数据到共享内存;3)由于GPU不能向CPU那样支持乱序执行,主循环中需要先将下一次循环计算需要的Gloabal Memory中的数据load 到寄存器,然后进行本次计算,之后再将load到寄存器中的数据写到Shared Memory,这样在LDG指令向Global Memory做load时,不会影响后续HFMA及其它运算指令的 launch 执行,也就达到了Double Buffers的目的,具体代码见[hgemm.cu](./hgemm.cu)。
208217
218+
### Tile Block
219+
220+
TODO
221+
222+
### Tile Thread
223+
224+
TODO
225+
226+
### Pack LDST 128 bits
227+
228+
TODO
229+
209230
### Async Copy
210231
211232
TODO
@@ -214,16 +235,33 @@ TODO
214235
215236
TODO
216237
238+
### Tensor Cores(WMMA/MMA)
239+
240+
TODO
241+
242+
### Tile MMA/Warp
243+
244+
TODO
245+
217246
### Thread Block Swizze
218247
219248
TODO
220249
221250
### Warp Swizzle
222251
252+
TODO
253+
223254
### Reg Double Buffers
224255
225256
TODO
226257
258+
### Collective Store(Reg Reuse&Warp Shuffle)
259+
260+
TODO
261+
262+
### SMEM Swizzle/Permuted
263+
264+
TODO
227265
228266
## 参考文献
229267

hgemm/hgemm_mma_stage.cu

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1912,13 +1912,19 @@ hgemm_mma_m16n8k16_mma2x4_warp4x4x2_stages_dsmem_rr_kernel(
19121912
}
19131913
}
19141914

1915-
// TODO: smem swizzle per 8x8 submatrix(not per line, 8 half values per line)
1915+
// TODO: smem swizzle per 4x8 submatrix(not per line, 8 half values per line)
1916+
// In CUTLASS, each group of four threads is assigned a specific address in
1917+
// shared memory. This approach allows avoiding conflicts when reading from
1918+
// and writing to shared memory without increasing the shared memory usage.
1919+
// https://developer.download.nvidia.cn/video/gputechconf/gtc/2019/presentation
1920+
// /s9593-cutensor-high-performance-tensor-operations-in-cuda-v2.pdf
19161921
// A matrix smem, MMA_MxMMA_K=16x16; B matrix smem, MMA_KxMMA_N=16x8; PTX layout.
19171922
// reference:
19181923
// https://zhuanlan.zhihu.com/p/638522893
19191924
// https://zhuanlan.zhihu.com/p/696231622
19201925
// https://www.zhihu.com/question/600927104/answer/3029266372
1921-
// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-load-instruction-ldmatrix
1926+
// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
1927+
// #warp-level-matrix-load-instruction-ldmatrix
19221928

19231929

19241930
// --------------------- PyTorch bindings for custom kernel -----------------------

0 commit comments

Comments
 (0)