Skip to content

Commit c78c247

Browse files
authored
[HGEMM] Update HGEMM/SGEMM Supported Matrix (#117)
* Update hgemm_mma_stage.cu * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md
1 parent 245dcff commit c78c247

File tree

3 files changed

+24
-9
lines changed

3 files changed

+24
-9
lines changed

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,13 @@
1818
|CUDA Cores|Sliced K(Loop over K)|Tile Block|Tile Thread|
1919
|:---:|:---:|:---:|:---:|
2020
|✔️|✔️|✔️|✔️|
21-
|**WMMA(m16n16k16)**|**MMA(m16n8k16)**|**Pack LDST**|**SMEM Padding**|
22-
|✔️|✔️|✔️|✔️|
21+
|**WMMA(m16n16k16)**|**MMA(m16n8k16)**|**Pack LDST(128 bits)**|**SMEM Padding**|
22+
|✔️|✔️|✔️|✔️|
2323
|**Copy Async**|**Tile MMA(More Threads)**|**Tile Warp(More Values)**|**Multi Stages**|
2424
|✔️|✔️|✔️|✔️|
25-
|**Reg Double Buffers**|**Block Swizzle**|**Warp Swizzle**|**Collective Store(Shuffle)**|
25+
|**Reg Double Buffers**|**Block Swizzle**|**Warp Swizzle**|**Collective Store(Reg Reuse&Warp Shfl)**|
2626
|✔️|✔️|✔️|✔️|
27-
|**Row Major(NN)**|**Col Major(TN)**|**SGEMM TF32**|**SMEM Swizzle**|
27+
|**Row Major(NN)**|**Col Major(TN)**|**SGEMM TF32**|**SMEM Swizzle/Permuted**|
2828
|✔️|✔️|✔️||
2929

3030
## 📖 CUDA Kernel目录 (面试常考题目)

hgemm/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# HGEMM
22

3-
## HGEMM Supported Matrix
3+
## HGEMM/SGEMM Supported Matrix
44

55
|CUDA Cores|Sliced K(Loop over K)|Tile Block|Tile Thread|
66
|:---:|:---:|:---:|:---:|
@@ -9,10 +9,10 @@
99
|✔️|✔️|✔️|✔️|
1010
|**Copy Async**|**Tile MMA(More Threads)**|**Tile Warp(More Values)**|**Multi Stages**|
1111
|✔️|✔️|✔️|✔️|
12-
|**Reg Double Buffers**|**Block Swizzle**|**Warp Swizzle**|**Collective Store(Reg Reuse&Warp Shuffle)**|
12+
|**Reg Double Buffers**|**Block Swizzle**|**Warp Swizzle**|**Collective Store(Reg Reuse&Warp Shfl)**|
1313
|✔️|✔️|✔️|✔️|
14-
|**Row Major(NN)**|**Col Major(TN)**|**SMEM Swizzle**|...|
15-
|✔️|✔️||...|
14+
|**Row Major(NN)**|**Col Major(TN)**|**SGEMM TF32**|**SMEM Swizzle/Permuted**|
15+
|✔️|✔️|✔️||
1616

1717
<details>
1818
<summary> 🔑️ 点击查看所有支持的HGEMM Kernels! </summary>

sgemm/README.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,20 @@
11
# SGEMM
22

3+
## HGEMM/SGEMM Supported Matrix
4+
5+
|CUDA Cores|Sliced K(Loop over K)|Tile Block|Tile Thread|
6+
|:---:|:---:|:---:|:---:|
7+
|✔️|✔️|✔️|✔️|
8+
|**WMMA(m16n16k16)**|**MMA(m16n8k16)**|**Pack LDST(128 bits)**|**SMEM Padding**|
9+
|✔️|✔️|✔️|✔️|
10+
|**Copy Async**|**Tile MMA(More Threads)**|**Tile Warp(More Values)**|**Multi Stages**|
11+
|✔️|✔️|✔️|✔️|
12+
|**Reg Double Buffers**|**Block Swizzle**|**Warp Swizzle**|**Collective Store(Reg Reuse&Warp Shfl)**|
13+
|✔️|✔️|✔️|✔️|
14+
|**Row Major(NN)**|**Col Major(TN)**|**SGEMM TF32**|**SMEM Swizzle/Permuted**|
15+
|✔️|✔️|✔️||
16+
17+
318
## 0x00 说明
419

520
包含以下内容:
@@ -15,7 +30,7 @@
1530
- [X] PyTorch bindings
1631

1732
## 目前性能
18-
目前在L20上,CUDA Cores FP32(L20 FP32/TF32理论算力为59.8 TFLOPS) 的实现能达到cuBLAS大概85%~90%左右的性能(TFLOPS),部分size下会超过cuBLAS。已知问题为bank conflicts没有完全消除,目前通过padding的方式缓解bank conflicts会导致shared memory浪费,也会影响SM occupancy。而Tensor Cores TF32的实现,只能达到cuBLAS TF32大概80%左右的性能,尚有较大差距。目前未手工实现Warp swizzle(受限于WMMA API的灵活性以及本人的能力),后续将会尝试通过MMA PTX实现warp swizzle。另外,当前TF32的实现依赖额外的FP32转TF32的kernel,对整体性能有影响。
33+
目前在L20上,CUDA Cores FP32(L20 FP32/TF32理论算力为59.8 TFLOPS) 的实现能达到cuBLAS大概85%~90%左右的性能(TFLOPS),部分size下会超过cuBLAS。已知问题为bank conflicts没有完全消除,目前通过padding的方式缓解bank conflicts会导致shared memory浪费,也会影响SM occupancy。而Tensor Cores TF32的实现,只能达到cuBLAS TF32大概80%左右的性能,尚有较大差距。目前未手工实现smem swizzle(受限于WMMA API的灵活性以及本人的能力),后续将会尝试通过MMA PTX实现smem swizzle/permuted。另外,当前TF32的实现依赖额外的FP32转TF32的kernel,对整体性能有影响。
1934

2035
## 共享内存 Bank Conflicts
2136

0 commit comments

Comments
 (0)