CUDA Stencil Benchmark

High-performance CUDA kernel generation and benchmarking framework for GPU optimization with correctness validation and performance characterization using LLM-guided code generation.

Overview

Primary Goal: Produce correct and performant CUDA kernels for complex 3D finite-difference stencil computations.

Broader Purpose: Develop a generalizable workflow for LLM-guided kernel generation that can be applied to other computational kernels. The framework includes simpler examples (e.g., matrix multiplication) to guide development and explore the boundaries of workflow generality.

This framework provides:

LLM-Guided Kernel Generation: Systematic workflow for generating optimized CUDA kernels from task specifications
Correctness-First Validation: Automated numerical parity verification against CPU reference
Performance Analysis: Roofline methodology for performance characterization
Iterative Optimization: Feedback-driven workflow for continuous improvement
Extensible Design: Support for multiple task types to explore workflow generality

Quick Start

Prerequisites

CUDA toolkit (for GPU kernels)
C++ compiler with C++17 support
Python 3.8+ (for analysis tools)

Building

# CPU benchmark
cd cpu_bench
make
./bench_kernel

# CUDA kernel
cd cuda
make ATTEMPT=000_baseline ARCH=sm_75
./bench_cuda_000_baseline

# Matmul benchmark
cd matmul_bench/cpu
make
./bench_matmul_cpu

cd ../cuda
make ATTEMPT=000_baseline ARCH=sm_75
./bench_matmul_cuda_000_baseline

Testing

# CPU benchmark tests
cd tests
pytest -v

# CUDA correctness test
cd cuda
VERIFY=1 ./bench_cuda_000_baseline

# CUDA performance benchmark
CSV=1 ./bench_cuda_000_baseline > results.csv

Project Structure

cuda-stencil-benchmark/
├── include/           # Interface definitions
├── tasks/            # Task specifications
├── prompts/          # LLM prompt templates
├── cpu_bench/        # CPU benchmarking (stencil)
├── cuda/             # CUDA kernels and harness (stencil)
├── matmul_bench/     # Matrix multiplication benchmarks
│   ├── cpu/          # CPU matmul reference
│   └── cuda/         # CUDA matmul kernels
├── analysis/          # Analysis tools
├── roofline/         # Roofline methodology
├── tests/            # Test framework
└── docs/             # Documentation

Documentation

GOAL.md: Project objectives and success criteria
DESIGN.md: System architecture and design decisions
WORKFLOW.md: LLM-guided kernel generation workflow
RESULTS.md: Performance results and analysis
cpu_baseline.md: CPU baseline characterization and validation
roofline.md: Roofline methodology for performance analysis
stencil_order.md: Stencil discretization order verification
GIT_WORKFLOW.md: Git branching strategy and development workflow

Key Features

Systematic Benchmarking: Comprehensive performance sweeps and analysis
Correctness Verification: Automated numerical parity testing
Roofline Analysis: Quantitative performance characterization
Modular Design: Easy to add new tasks and kernel attempts
Workflow Exploration: Multiple task types to test workflow generality

Technical Highlights

Memory-Bound Optimization: Achieved 92% of roofline ceiling on T4 GPU through z-coalesced memory access patterns
Correctness-First Methodology: Automated parity testing identified optimization issues before deployment
Iterative LLM-Guided Development: Systematic workflow demonstrating multiple kernel attempts with validation
Performance Engineering: Roofline analysis provides quantitative understanding of memory vs compute bottlenecks
Extensible Architecture: Framework supports multiple computational patterns (stencil, matmul) to explore workflow generality

Results

The framework has been successfully applied to generate and validate multiple CUDA kernel implementations for 3D finite-difference stencil computations. Key achievements:

Baseline Performance: Achieved ~70.4 GF/s on T4 GPU, reaching ~92% of roofline ceiling
Correctness Validation: Automated parity testing identified optimization issues before deployment
Iterative Optimization: Multiple kernel attempts demonstrate systematic optimization workflow
Workflow Generality: Extended to matrix multiplication to explore broader applicability

For detailed performance results, analysis, and insights, see RESULTS.md.

License

MIT License - See LICENSE file for details.

Author

Jason Larkin, PhD - Performance engineering for scientific computing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CUDA Stencil Benchmark

Overview

Quick Start

Prerequisites

Building

Testing

Project Structure

Documentation

Key Features

Technical Highlights

Results

License

Author

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
analysis		analysis
cpu_bench		cpu_bench
cuda		cuda
docs		docs
include		include
matmul_bench		matmul_bench
prompts		prompts
roofline		roofline
scripts		scripts
tasks		tasks
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

License

jasonlarkin/cuda-stencil-benchmark

Folders and files

Latest commit

History

Repository files navigation

CUDA Stencil Benchmark

Overview

Quick Start

Prerequisites

Building

Testing

Project Structure

Documentation

Key Features

Technical Highlights

Results

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages