High-performance CUDA kernel generation and benchmarking framework for GPU optimization with correctness validation and performance characterization using LLM-guided code generation.
Primary Goal: Produce correct and performant CUDA kernels for complex 3D finite-difference stencil computations.
Broader Purpose: Develop a generalizable workflow for LLM-guided kernel generation that can be applied to other computational kernels. The framework includes simpler examples (e.g., matrix multiplication) to guide development and explore the boundaries of workflow generality.
This framework provides:
- LLM-Guided Kernel Generation: Systematic workflow for generating optimized CUDA kernels from task specifications
- Correctness-First Validation: Automated numerical parity verification against CPU reference
- Performance Analysis: Roofline methodology for performance characterization
- Iterative Optimization: Feedback-driven workflow for continuous improvement
- Extensible Design: Support for multiple task types to explore workflow generality
- CUDA toolkit (for GPU kernels)
- C++ compiler with C++17 support
- Python 3.8+ (for analysis tools)
# CPU benchmark
cd cpu_bench
make
./bench_kernel
# CUDA kernel
cd cuda
make ATTEMPT=000_baseline ARCH=sm_75
./bench_cuda_000_baseline
# Matmul benchmark
cd matmul_bench/cpu
make
./bench_matmul_cpu
cd ../cuda
make ATTEMPT=000_baseline ARCH=sm_75
./bench_matmul_cuda_000_baseline# CPU benchmark tests
cd tests
pytest -v
# CUDA correctness test
cd cuda
VERIFY=1 ./bench_cuda_000_baseline
# CUDA performance benchmark
CSV=1 ./bench_cuda_000_baseline > results.csvcuda-stencil-benchmark/
├── include/ # Interface definitions
├── tasks/ # Task specifications
├── prompts/ # LLM prompt templates
├── cpu_bench/ # CPU benchmarking (stencil)
├── cuda/ # CUDA kernels and harness (stencil)
├── matmul_bench/ # Matrix multiplication benchmarks
│ ├── cpu/ # CPU matmul reference
│ └── cuda/ # CUDA matmul kernels
├── analysis/ # Analysis tools
├── roofline/ # Roofline methodology
├── tests/ # Test framework
└── docs/ # Documentation
- GOAL.md: Project objectives and success criteria
- DESIGN.md: System architecture and design decisions
- WORKFLOW.md: LLM-guided kernel generation workflow
- RESULTS.md: Performance results and analysis
- cpu_baseline.md: CPU baseline characterization and validation
- roofline.md: Roofline methodology for performance analysis
- stencil_order.md: Stencil discretization order verification
- GIT_WORKFLOW.md: Git branching strategy and development workflow
- Systematic Benchmarking: Comprehensive performance sweeps and analysis
- Correctness Verification: Automated numerical parity testing
- Roofline Analysis: Quantitative performance characterization
- Modular Design: Easy to add new tasks and kernel attempts
- Workflow Exploration: Multiple task types to test workflow generality
- Memory-Bound Optimization: Achieved 92% of roofline ceiling on T4 GPU through z-coalesced memory access patterns
- Correctness-First Methodology: Automated parity testing identified optimization issues before deployment
- Iterative LLM-Guided Development: Systematic workflow demonstrating multiple kernel attempts with validation
- Performance Engineering: Roofline analysis provides quantitative understanding of memory vs compute bottlenecks
- Extensible Architecture: Framework supports multiple computational patterns (stencil, matmul) to explore workflow generality
The framework has been successfully applied to generate and validate multiple CUDA kernel implementations for 3D finite-difference stencil computations. Key achievements:
- Baseline Performance: Achieved ~70.4 GF/s on T4 GPU, reaching ~92% of roofline ceiling
- Correctness Validation: Automated parity testing identified optimization issues before deployment
- Iterative Optimization: Multiple kernel attempts demonstrate systematic optimization workflow
- Workflow Generality: Extended to matrix multiplication to explore broader applicability
For detailed performance results, analysis, and insights, see RESULTS.md.
MIT License - See LICENSE file for details.
Jason Larkin, PhD - Performance engineering for scientific computing