Optimizing a Metal Matmul Kernel

siboehm's post explains how to iteratively improve the performance of a CUDA kernel for matrix multiplication.

This repo contains a reimplementation of those kernels (not all yet) on Metal, Apple's GPUs compute API.

Running a kernel

./src/run.py

Performance on M1 Pro:

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt