CUDA Matrix Multiplication Benchmarking

This project implements and benchmarks different approaches to matrix multiplication using CUDA:

Sequential CPU implementation
Naive GPU implementation
GPU implementation with memory coalescing (via thread remapping)
Tiled GPU implementation using shared memory
Tiled GPU implementation using shared memory and memory coalescing (via matrix B transposition)

Future work:

More optimizations planned (vectorized memory access, register tiling, etc.)

The implementations and results are discussed in detail in these blog posts on my personal site:

Building the Project

mkdir build && cd build
cmake ..
make

Usage

The executable supports different modes:

Run Full Benchmark Suite

./matmul

This runs benchmarks for all implementations across matrix sizes: 32×32, 256×256, 1024×1024, and 2048×2048.

Profile Specific Implementation

./matmul profile <type> <dim>

<type>: Implementation type (naive_gpu, coalesced_gpu, tiled_gpu, tiled_coalesced_gpu)
<dim>: Matrix dimension (creates dim×dim matrices)

Example:

./matmul profile tiled_gpu 1024

NVIDIA Nsight Compute Profiling

For detailed GPU metrics:

ncu --set full -o naive_2048_full.ncu-rep ./matmul profile naive_gpu 2048
ncu --set full -o tiled_2048_full.ncu-rep ./matmul profile tiled_gpu 2048
ncu --set full -o tiled_coalesced_2048_full.ncu-rep ./matmul profile tiled_coalesced_gpu 2048

Implementation Details

The project implements matrix multiplication using different approaches:

Each thread computes one element of the output matrix (naive_gpu)
Uses shared memory tiling to improve memory access patterns (tiled_gpu)
Basic CPU implementation for baseline comparison (sequential_cpu)

Each implementation can be benchmarked and profiled independently to compare performance across different metrics. For now these metrics include GFLOPS and time (ms).

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Matrix Multiplication Benchmarking

Building the Project

Usage

Run Full Benchmark Suite

Profile Specific Implementation

NVIDIA Nsight Compute Profiling

Implementation Details

About

Uh oh!

Releases

Packages

Languages

AndreasHolt/cuda-matmul-benchmarking

Folders and files

Latest commit

History

Repository files navigation

CUDA Matrix Multiplication Benchmarking

Building the Project

Usage

Run Full Benchmark Suite

Profile Specific Implementation

NVIDIA Nsight Compute Profiling

Implementation Details

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages