Collection of example kernels:
- Tiled MatMul: A simple implementation of tiled multiplication
- 1D Softmax: different implementations of 1D softmax with some profiling
- Flash Atetntion: Implementation of fused matmul and softmax and then flash attention.
- Reduce: Simple implementations of the sum/reduce kernel.
make setupTested on:
- NVIDIA A10G
- CUDA Version: 12.6