File tree Expand file tree Collapse file tree 3 files changed +24
-9
lines changed Expand file tree Collapse file tree 3 files changed +24
-9
lines changed Original file line number Diff line number Diff line change 18
18
| CUDA Cores| Sliced K(Loop over K)| Tile Block| Tile Thread|
19
19
| :---:| :---:| :---:| :---:|
20
20
| ✔️| ✔️| ✔️| ✔️|
21
- | ** WMMA(m16n16k16)** | ** MMA(m16n8k16)** | ** Pack LDST** | ** SMEM Padding** |
22
- | ✔️| ✔️| ✔️| ✔️|
21
+ | ** WMMA(m16n16k16)** | ** MMA(m16n8k16)** | ** Pack LDST(128 bits) ** | ** SMEM Padding** |
22
+ | ✔️| ✔️| ✔️| ✔️|
23
23
| ** Copy Async** | ** Tile MMA(More Threads)** | ** Tile Warp(More Values)** | ** Multi Stages** |
24
24
| ✔️| ✔️| ✔️| ✔️|
25
- | ** Reg Double Buffers** | ** Block Swizzle** | ** Warp Swizzle** | ** Collective Store(Shuffle )** |
25
+ | ** Reg Double Buffers** | ** Block Swizzle** | ** Warp Swizzle** | ** Collective Store(Reg Reuse&Warp Shfl )** |
26
26
| ✔️| ✔️| ✔️| ✔️|
27
- | ** Row Major(NN)** | ** Col Major(TN)** | ** SGEMM TF32** | ** SMEM Swizzle** |
27
+ | ** Row Major(NN)** | ** Col Major(TN)** | ** SGEMM TF32** | ** SMEM Swizzle/Permuted ** |
28
28
| ✔️| ✔️| ✔️| ❔|
29
29
30
30
## 📖 CUDA Kernel目录 (面试常考题目)
Original file line number Diff line number Diff line change 1
1
# HGEMM
2
2
3
- ## HGEMM Supported Matrix
3
+ ## HGEMM/SGEMM Supported Matrix
4
4
5
5
| CUDA Cores| Sliced K(Loop over K)| Tile Block| Tile Thread|
6
6
| :---:| :---:| :---:| :---:|
9
9
| ✔️| ✔️| ✔️| ✔️|
10
10
| ** Copy Async** | ** Tile MMA(More Threads)** | ** Tile Warp(More Values)** | ** Multi Stages** |
11
11
| ✔️| ✔️| ✔️| ✔️|
12
- | ** Reg Double Buffers** | ** Block Swizzle** | ** Warp Swizzle** | ** Collective Store(Reg Reuse&Warp Shuffle )** |
12
+ | ** Reg Double Buffers** | ** Block Swizzle** | ** Warp Swizzle** | ** Collective Store(Reg Reuse&Warp Shfl )** |
13
13
| ✔️| ✔️| ✔️| ✔️|
14
- | ** Row Major(NN)** | ** Col Major(TN)** | ** SMEM Swizzle** | ... |
15
- | ✔️| ✔️| ❔ | ... |
14
+ | ** Row Major(NN)** | ** Col Major(TN)** | ** SGEMM TF32 ** | ** SMEM Swizzle/Permuted ** |
15
+ | ✔️| ✔️| ✔️ | ❔ |
16
16
17
17
<details >
18
18
<summary > 🔑️ 点击查看所有支持的HGEMM Kernels! </summary >
Original file line number Diff line number Diff line change 1
1
# SGEMM
2
2
3
+ ## HGEMM/SGEMM Supported Matrix
4
+
5
+ | CUDA Cores| Sliced K(Loop over K)| Tile Block| Tile Thread|
6
+ | :---:| :---:| :---:| :---:|
7
+ | ✔️| ✔️| ✔️| ✔️|
8
+ | ** WMMA(m16n16k16)** | ** MMA(m16n8k16)** | ** Pack LDST(128 bits)** | ** SMEM Padding** |
9
+ | ✔️| ✔️| ✔️| ✔️|
10
+ | ** Copy Async** | ** Tile MMA(More Threads)** | ** Tile Warp(More Values)** | ** Multi Stages** |
11
+ | ✔️| ✔️| ✔️| ✔️|
12
+ | ** Reg Double Buffers** | ** Block Swizzle** | ** Warp Swizzle** | ** Collective Store(Reg Reuse&Warp Shfl)** |
13
+ | ✔️| ✔️| ✔️| ✔️|
14
+ | ** Row Major(NN)** | ** Col Major(TN)** | ** SGEMM TF32** | ** SMEM Swizzle/Permuted** |
15
+ | ✔️| ✔️| ✔️| ❔|
16
+
17
+
3
18
## 0x00 说明
4
19
5
20
包含以下内容:
15
30
- [X] PyTorch bindings
16
31
17
32
## 目前性能
18
- 目前在L20上,CUDA Cores FP32(L20 FP32/TF32理论算力为59.8 TFLOPS) 的实现能达到cuBLAS大概85%~ 90%左右的性能(TFLOPS),部分size下会超过cuBLAS。已知问题为bank conflicts没有完全消除,目前通过padding的方式缓解bank conflicts会导致shared memory浪费,也会影响SM occupancy。而Tensor Cores TF32的实现,只能达到cuBLAS TF32大概80%左右的性能,尚有较大差距。目前未手工实现Warp swizzle(受限于WMMA API的灵活性以及本人的能力),后续将会尝试通过MMA PTX实现warp swizzle。另外,当前TF32的实现依赖额外的FP32转TF32的kernel,对整体性能有影响。
33
+ 目前在L20上,CUDA Cores FP32(L20 FP32/TF32理论算力为59.8 TFLOPS) 的实现能达到cuBLAS大概85%~ 90%左右的性能(TFLOPS),部分size下会超过cuBLAS。已知问题为bank conflicts没有完全消除,目前通过padding的方式缓解bank conflicts会导致shared memory浪费,也会影响SM occupancy。而Tensor Cores TF32的实现,只能达到cuBLAS TF32大概80%左右的性能,尚有较大差距。目前未手工实现smem swizzle(受限于WMMA API的灵活性以及本人的能力),后续将会尝试通过MMA PTX实现smem swizzle/permuted 。另外,当前TF32的实现依赖额外的FP32转TF32的kernel,对整体性能有影响。
19
34
20
35
## 共享内存 Bank Conflicts
21
36
You can’t perform that action at this time.
0 commit comments