File tree Expand file tree Collapse file tree 3 files changed +3
-2
lines changed Expand file tree Collapse file tree 3 files changed +3
-2
lines changed Original file line number Diff line number Diff line change 159159| ✔️ [ hgemv_k16_f16] ( ./hgemv/hgemv.cu ) | f16| f16| [ link] ( ./hgemv/ ) | ⭐️⭐️⭐️|
160160| ✔️ [ flash_attn_1_fwd_f32] ( ./flash-attn/flash_attn.cu ) | f32| f32| [ link] ( ./flash-attn ) | ⭐️⭐️⭐️|
161161| ✔️ [ flash_attn_2_fwd_f16_m16n8k16* ] ( ./flash-attn/flash_attn_mma.cu ) | f16| f16| [ link] ( ./flash-attn ) | ⭐️⭐️⭐️|
162- | ✔️ [ hard_nms cpp only ] ( ./nms/nms.cc ) | f32| /| / | ⭐️|
162+ | ✔️ [ nms_kernel ] ( ./nms/nms.cu ) | f32| /| [ link ] ( ./nms ) | ⭐️ ⭐️|
163163| ✔️ [ notes v1(deprecated)] ( ./notes-v1.cu ) | f32| f32| /| ⭐️|
164164
165165👉TIPS: * means using ** Tensor Cores(MMA/WMMA)** , otherwise, using CUDA Cores by default.
Original file line number Diff line number Diff line change 3333
3434- NVIDIA L20
3535
36- 目前最优的实现,在L20上(理论Tensor Cores FP16算力为 119.5 TFLOPS),使用WMMA API能达到cuBLAS大概95%~ 98%左右的性能(105-113 TFLOPS vs 105-115 TFLOPS),使用MMA API能达到115 TFLOPS,部分case会超越cuBLAS。已知问题为bank conflicts没有完全消除,目前通过padding的方式缓解bank conflicts会导致shared memory浪费,也会影响SM occupancy。并且尚未手工实现smem swizzle(受限于WMMA API的灵活性以及row major的layout),后续将会尝试通过MMA PTX和col major的layout实现smem swizzle,[ 点击查看性能数据] ( #NV-L20 ) 。
36+ 目前最优的实现,在L20上(理论Tensor Cores FP16算力为 119.5 TFLOPS),使用WMMA API能达到cuBLAS大概95%~ 98%左右的性能(105-113 TFLOPS vs 105-115 TFLOPS),使用MMA API能达到115 TFLOPS,部分case会超越cuBLAS。已知问题为bank conflicts没有完全消除,目前通过padding的方式缓解bank conflicts会导致shared memory浪费,也会影响SM occupancy。并且尚未手工实现smem swizzle/permute (受限于WMMA API的灵活性以及row major的layout),后续将会尝试通过MMA PTX和col major的layout实现smem swizzle/permute ,[ 点击查看性能数据] ( #NV-L20 ) 。
3737
3838- NVIDIA GeForce RTX 3080 Laptop
3939
Original file line number Diff line number Diff line change @@ -1015,6 +1015,7 @@ hgemm_mma_m16n8k16_mma2x4_warp4x4x2_stages_dsmem_kernel(
10151015 }
10161016 }
10171017
1018+ // collective store with reg reuse & warp shuffle
10181019 for (int i = 0 ; i < WARP_TILE_M; ++i) {
10191020 // reuse RA[2][4][4] reg here, this may boost 0.3~0.5 TFLOPS up.
10201021 // may not put 'if' in N loop, it will crash the 'pragma unroll' hint ?
You can’t perform that action at this time.
0 commit comments