sparse_gemm/README.md at master · jin-yc10/sparse_gemm · GitHub

8 lines (8 loc) · 369 Bytes

A simple gemm kernel for sparse convolution. Mainly an implicit GEMM cuda kernel with naive tensor-core. Used following tricks to improve overall runtime

tensor-core
float16 arithmetic
and half2 intrinsic ( hfma2, hmul2, hadd2 )
software pipelining
combined memory access ( ldg128, stg128 ). Note that for A100, we should consider using ldgsts intrinsic