This project implements and benchmarks different approaches to matrix multiplication using CUDA:
- Sequential CPU implementation
- Naive GPU implementation
- GPU implementation with memory coalescing (via thread remapping)
- Tiled GPU implementation using shared memory
- Tiled GPU implementation using shared memory and memory coalescing (via matrix B transposition)
Future work:
- More optimizations planned (vectorized memory access, register tiling, etc.)
The implementations and results are discussed in detail in these blog posts on my personal site:
- Part 1: Naive GPU Implementation, Explanation, and CPU vs naive GPU Benchmarking
- Part 2: Tiled Matrix Multiplication Explained and Implemented, Benchmarking against naive GPU, and Performance Analysis with Nsight Compute
mkdir build && cd build
cmake ..
makeThe executable supports different modes:
./matmulThis runs benchmarks for all implementations across matrix sizes: 32×32, 256×256, 1024×1024, and 2048×2048.
./matmul profile <type> <dim><type>: Implementation type (naive_gpu,coalesced_gpu,tiled_gpu,tiled_coalesced_gpu)<dim>: Matrix dimension (creates dim×dim matrices)
Example:
./matmul profile tiled_gpu 1024For detailed GPU metrics:
ncu --set full -o naive_2048_full.ncu-rep ./matmul profile naive_gpu 2048
ncu --set full -o tiled_2048_full.ncu-rep ./matmul profile tiled_gpu 2048
ncu --set full -o tiled_coalesced_2048_full.ncu-rep ./matmul profile tiled_coalesced_gpu 2048The project implements matrix multiplication using different approaches:
- Each thread computes one element of the output matrix (
naive_gpu) - Uses shared memory tiling to improve memory access patterns (
tiled_gpu) - Basic CPU implementation for baseline comparison (
sequential_cpu)
Each implementation can be benchmarked and profiled independently to compare performance across different metrics. For now these metrics include GFLOPS and time (ms).