How to match cublas on half precision (M_N_K_f16_f16_f32) matrix multiplication?

This repo contains educational cuda kernels that build up from a single warp calculating a single 16x8x16 tile using a PTX mma instruction up to outperforming cublas by 20% for a M=N=K=1024 f16xfp16xf32 matrix multiplication.

NOTE: all benchmarks are performed on RTX 3070 which is sm_86 (while it is ampere it is different from a 4090 (sm_89) and has less shared memory)

[Single warp MMA][01_warp_mma.cu]

This example shows how to correctly load/store the A B and C matrix to use the mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 instruction.

nvcc  -lcublas -arch=sm_86 -O3 01_warp_mma.cu -o main && ./main

[Multi warp MMA][02_multi_warp_mma.cu]

This example shows how to utilize multiple warps to compute a larger tile

nvcc  -lcublas -arch=sm_86 -O3 02_multi_warp_mma.cu -o main && ./main

[Full tiled matrix multiplication][03_tile_mma.cu]

This example shows how to go from computing a single tile with multip warps to integrating the the inner kernel into a full tiled matrix multiplication.

 nvcc  -lcublas -arch=sm_86 -O3 03_tile_mma.cu -o main && ./main

12 % of cublas performance

[128bit vectorized loads/stores][04_vectorized_mem_access.cu]

This example shows how to perform 128bit loads and stores on every thread.

nvcc  -lcublas -arch=sm_86 -O3 04_vectorized_mem_access.cu -o main && ./main

13 % of cublas performance

[Reducing bank conflicts by LD padding][05_ld_padding.cu]

This example shows how to reduce shared memory bank conflicts by padding the leading dimension of the A, B and C matrices.

nvcc  -lcublas -arch=sm_86 -O3 05_ld_padding.cu -o main && ./main

68 % of cublas performance

[Reducing bank conflicts by shared memory swizzling][06_smem_swizzling.cu]

This example shows how to reduce shared memory bank conflicts by XOR swizzling shared memory accesses.

nvcc  -lcublas -arch=sm_86 -O3 06_smem_swizzling.cu -o main && ./main

58 % of cublas performance

[Increase register usage by promoting C to registers][07_c_reg_tile.cu]

This example shows how to decrease shared memory usage and increase register usage by keeping the entire C tile in registers.

nvcc  -lcublas -arch=sm_86 -O3 07_c_reg_tile.cu -o main && ./main

103 % of cublas performance

[Software pipelining][08_pipelining.cu]

This example shows how to do software pipelining with async shared memory load/stores to overlap with compute.

nvcc  -lcublas -arch=sm_86 -O3 08_pipelining.cu -o main && ./main

97 % of cublas performance

[C_frag -> C shared memory staging][09_c_smem_staging.cu]

This examples shows how to get coalesced global memory stores to C by staging C_frag through shared memory

nvcc  -lcublas -arch=sm_86 -O3 09_c_smem_staging.cu -o main && ./main

104 % of cublas performance

[Software pipelining and shared memory swizzling][10_pipelining_with_swizzling.cu]

This examples shows how to combine software pipelining (a technique that increases shared memory usage) and swizzling (a techniques that reduces shared memory usage in comparison to LD padding) to make software pipeling profitable.

nvcc  -lcublas -arch=sm_86 -O3 10_pipelining_with_swizzling.cu -o main && ./main

105 % of cublas performance

[Software pipelining/shared memory swizzling/C shared memory staging][11_pipeling_swizzling_c_smem_staging.cu]

Software pipeling + shared memory swizzling + C shared memory staging. This is the kernel with the highest performance.

nvcc  -lcublas -arch=sm_86 -O3 11_pipelining_with_swizzling.cu -o main && ./main

120 % of cublas performance

[BONUS: how much performance do we lose by not using the ldmatrix PTX instructions?][12_fastest_without_ldmatrix.cu]

Takes fastest implementation and replaces usage of the ldmatrix PTX instruction by manually reading the matrix fragments from shared memory.

nvcc  -lcublas -arch=sm_86 -O3 12_fastest_without_ldmatrix.cu -o main && ./main

100 % of cublas performance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to match cublas on half precision (M_N_K_f16_f16_f32) matrix multiplication?

[Single warp MMA][01_warp_mma.cu]

[Multi warp MMA][02_multi_warp_mma.cu]

[Full tiled matrix multiplication][03_tile_mma.cu]

[128bit vectorized loads/stores][04_vectorized_mem_access.cu]

[Reducing bank conflicts by LD padding][05_ld_padding.cu]

[Reducing bank conflicts by shared memory swizzling][06_smem_swizzling.cu]

[Increase register usage by promoting C to registers][07_c_reg_tile.cu]

[Software pipelining][08_pipelining.cu]

[C_frag -> C shared memory staging][09_c_smem_staging.cu]

[Software pipelining and shared memory swizzling][10_pipelining_with_swizzling.cu]

[Software pipelining/shared memory swizzling/C shared memory staging][11_pipeling_swizzling_c_smem_staging.cu]

[BONUS: how much performance do we lose by not using the ldmatrix PTX instructions?][12_fastest_without_ldmatrix.cu]

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
01_warp_mma.cu		01_warp_mma.cu
02_multi_warp_mma.cu		02_multi_warp_mma.cu
03_tile_mma.cu		03_tile_mma.cu
04_vectorized_mem_access.cu		04_vectorized_mem_access.cu
05_ld_padding.cu		05_ld_padding.cu
06_smem_swizzling.cu		06_smem_swizzling.cu
07_c_reg_tile.cu		07_c_reg_tile.cu
08_pipelining.cu		08_pipelining.cu
09_c_smem_staging.cu		09_c_smem_staging.cu
10_pipelining_with_swizzling.cu		10_pipelining_with_swizzling.cu
11_pipelining_swizzling_c_smem_staging.cu		11_pipelining_swizzling_c_smem_staging.cu
12_fastest_without_ldmatrix.cu		12_fastest_without_ldmatrix.cu
README.md		README.md
utils.cuh		utils.cuh
visualize_access_patterns.py		visualize_access_patterns.py

ziereis/cuda-fun

Folders and files

Latest commit

History

Repository files navigation

How to match cublas on half precision (M_N_K_f16_f16_f32) matrix multiplication?

[Single warp MMA][01_warp_mma.cu]

[Multi warp MMA][02_multi_warp_mma.cu]

[Full tiled matrix multiplication][03_tile_mma.cu]

[128bit vectorized loads/stores][04_vectorized_mem_access.cu]

[Reducing bank conflicts by LD padding][05_ld_padding.cu]

[Reducing bank conflicts by shared memory swizzling][06_smem_swizzling.cu]

[Increase register usage by promoting C to registers][07_c_reg_tile.cu]

[Software pipelining][08_pipelining.cu]

[C_frag -> C shared memory staging][09_c_smem_staging.cu]

[Software pipelining and shared memory swizzling][10_pipelining_with_swizzling.cu]

[Software pipelining/shared memory swizzling/C shared memory staging][11_pipeling_swizzling_c_smem_staging.cu]

[BONUS: how much performance do we lose by not using the ldmatrix PTX instructions?][12_fastest_without_ldmatrix.cu]

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages