siboehm's post explains how to iteratively improve the performance of a CUDA kernel for matrix multiplication.
This repo contains a reimplementation of those kernels (not all yet) on Metal, Apple's GPUs compute API.
./src/run.py
Performance on M1 Pro:
| Kernel | GFLOPs/s |
|---|---|
| 1: Naive | 20 |
| 2: GMEM Coalescing | 280 |
| 3: SMEM Caching | - |
| 4: 1D Blocktiling | - |
| 5: 2D Blocktiling | - |
| 6: Vectorized Mem Access | - |
| 9: Autotuning | - |
| 10: Warptiling | - |