Skip to content

jasonlarkin/cuda-stencil-benchmark

Repository files navigation

CUDA Stencil Benchmark

High-performance CUDA kernel generation and benchmarking framework for GPU optimization with correctness validation and performance characterization using LLM-guided code generation.

Overview

Primary Goal: Produce correct and performant CUDA kernels for complex 3D finite-difference stencil computations.

Broader Purpose: Develop a generalizable workflow for LLM-guided kernel generation that can be applied to other computational kernels. The framework includes simpler examples (e.g., matrix multiplication) to guide development and explore the boundaries of workflow generality.

This framework provides:

  • LLM-Guided Kernel Generation: Systematic workflow for generating optimized CUDA kernels from task specifications
  • Correctness-First Validation: Automated numerical parity verification against CPU reference
  • Performance Analysis: Roofline methodology for performance characterization
  • Iterative Optimization: Feedback-driven workflow for continuous improvement
  • Extensible Design: Support for multiple task types to explore workflow generality

Quick Start

Prerequisites

  • CUDA toolkit (for GPU kernels)
  • C++ compiler with C++17 support
  • Python 3.8+ (for analysis tools)

Building

# CPU benchmark
cd cpu_bench
make
./bench_kernel

# CUDA kernel
cd cuda
make ATTEMPT=000_baseline ARCH=sm_75
./bench_cuda_000_baseline

# Matmul benchmark
cd matmul_bench/cpu
make
./bench_matmul_cpu

cd ../cuda
make ATTEMPT=000_baseline ARCH=sm_75
./bench_matmul_cuda_000_baseline

Testing

# CPU benchmark tests
cd tests
pytest -v

# CUDA correctness test
cd cuda
VERIFY=1 ./bench_cuda_000_baseline

# CUDA performance benchmark
CSV=1 ./bench_cuda_000_baseline > results.csv

Project Structure

cuda-stencil-benchmark/
├── include/           # Interface definitions
├── tasks/            # Task specifications
├── prompts/          # LLM prompt templates
├── cpu_bench/        # CPU benchmarking (stencil)
├── cuda/             # CUDA kernels and harness (stencil)
├── matmul_bench/     # Matrix multiplication benchmarks
│   ├── cpu/          # CPU matmul reference
│   └── cuda/         # CUDA matmul kernels
├── analysis/          # Analysis tools
├── roofline/         # Roofline methodology
├── tests/            # Test framework
└── docs/             # Documentation

Documentation

Key Features

  • Systematic Benchmarking: Comprehensive performance sweeps and analysis
  • Correctness Verification: Automated numerical parity testing
  • Roofline Analysis: Quantitative performance characterization
  • Modular Design: Easy to add new tasks and kernel attempts
  • Workflow Exploration: Multiple task types to test workflow generality

Technical Highlights

  • Memory-Bound Optimization: Achieved 92% of roofline ceiling on T4 GPU through z-coalesced memory access patterns
  • Correctness-First Methodology: Automated parity testing identified optimization issues before deployment
  • Iterative LLM-Guided Development: Systematic workflow demonstrating multiple kernel attempts with validation
  • Performance Engineering: Roofline analysis provides quantitative understanding of memory vs compute bottlenecks
  • Extensible Architecture: Framework supports multiple computational patterns (stencil, matmul) to explore workflow generality

Results

The framework has been successfully applied to generate and validate multiple CUDA kernel implementations for 3D finite-difference stencil computations. Key achievements:

  • Baseline Performance: Achieved ~70.4 GF/s on T4 GPU, reaching ~92% of roofline ceiling
  • Correctness Validation: Automated parity testing identified optimization issues before deployment
  • Iterative Optimization: Multiple kernel attempts demonstrate systematic optimization workflow
  • Workflow Generality: Extended to matrix multiplication to explore broader applicability

For detailed performance results, analysis, and insights, see RESULTS.md.

License

MIT License - See LICENSE file for details.

Author

Jason Larkin, PhD - Performance engineering for scientific computing

About

LLM-guided CUDA kernel generation framework with correctness validation and roofline analysis

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published