Skip to content

Implementing and benchmarking various matmul implementations in CUDA

Notifications You must be signed in to change notification settings

AndreasHolt/cuda-matmul-benchmarking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CUDA Matrix Multiplication Benchmarking

This project implements and benchmarks different approaches to matrix multiplication using CUDA:

  • Sequential CPU implementation
  • Naive GPU implementation
  • GPU implementation with memory coalescing (via thread remapping)
  • Tiled GPU implementation using shared memory
  • Tiled GPU implementation using shared memory and memory coalescing (via matrix B transposition)

Future work:

  • More optimizations planned (vectorized memory access, register tiling, etc.)

The implementations and results are discussed in detail in these blog posts on my personal site:

Building the Project

mkdir build && cd build
cmake ..
make

Usage

The executable supports different modes:

Run Full Benchmark Suite

./matmul

This runs benchmarks for all implementations across matrix sizes: 32×32, 256×256, 1024×1024, and 2048×2048.

Profile Specific Implementation

./matmul profile <type> <dim>
  • <type>: Implementation type (naive_gpu, coalesced_gpu, tiled_gpu, tiled_coalesced_gpu)
  • <dim>: Matrix dimension (creates dim×dim matrices)

Example:

./matmul profile tiled_gpu 1024

NVIDIA Nsight Compute Profiling

For detailed GPU metrics:

ncu --set full -o naive_2048_full.ncu-rep ./matmul profile naive_gpu 2048
ncu --set full -o tiled_2048_full.ncu-rep ./matmul profile tiled_gpu 2048
ncu --set full -o tiled_coalesced_2048_full.ncu-rep ./matmul profile tiled_coalesced_gpu 2048

Implementation Details

The project implements matrix multiplication using different approaches:

  • Each thread computes one element of the output matrix (naive_gpu)
  • Uses shared memory tiling to improve memory access patterns (tiled_gpu)
  • Basic CPU implementation for baseline comparison (sequential_cpu)

Each implementation can be benchmarked and profiled independently to compare performance across different metrics. For now these metrics include GFLOPS and time (ms).

About

Implementing and benchmarking various matmul implementations in CUDA

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published