|
| 1 | +# IntelliPerf: AI-Powered GPU Performance Engineering Framework |
| 2 | + |
| 3 | +IntelliPerf is a Python-based framework that uses Large Language Models (LLMs) to automatically analyze and optimize GPU kernel performance. It supports HIP/ROCm, Triton, and PyTorch applications, targeting bottlenecks like bank conflicts, memory access patterns, and atomic contention. |
| 4 | + |
| 5 | +Always reference these instructions first and fallback to search or bash commands only when you encounter unexpected information that does not match the info here. |
| 6 | + |
| 7 | +## Working Effectively |
| 8 | + |
| 9 | +### Quick Start (Container Recommended) |
| 10 | +Use containers for full functionality including GPU-dependent features: |
| 11 | +```bash |
| 12 | +# Using Docker (recommended) |
| 13 | +./docker/build.sh |
| 14 | +./docker/run.sh |
| 15 | + |
| 16 | +# Using Apptainer |
| 17 | +./apptainer/build.sh |
| 18 | +./apptainer/run.sh |
| 19 | +``` |
| 20 | + |
| 21 | +### Development Installation (Basic Python Functionality) |
| 22 | +For Python-only development without GPU dependencies: |
| 23 | +```bash |
| 24 | +# Install the main package (takes ~90 seconds) |
| 25 | +pip install -e . |
| 26 | + |
| 27 | +# Verify installation |
| 28 | +intelliperf --help |
| 29 | +``` |
| 30 | + |
| 31 | +### Full Dependencies Installation (Network-Intensive) |
| 32 | +**WARNING**: This step frequently fails due to network timeouts. NEVER CANCEL builds - they may take 45+ minutes. |
| 33 | +```bash |
| 34 | +# Install external tools - NEVER CANCEL: Can take 45+ minutes. Set timeout to 60+ minutes. |
| 35 | +python3 scripts/install_tool.py --all |
| 36 | + |
| 37 | +# If network timeouts occur, this is expected - document as "may fail due to network limitations" |
| 38 | +``` |
| 39 | + |
| 40 | +### Examples Build (Requires ROCm/HIP) |
| 41 | +```bash |
| 42 | +# Build examples - requires ROCm/HIP environment |
| 43 | +cd examples |
| 44 | +./scripts/build_examples.sh -c |
| 45 | + |
| 46 | +# Clean build if needed |
| 47 | +./scripts/build_examples.sh -c --clean |
| 48 | + |
| 49 | +# Verbose build for debugging |
| 50 | +./scripts/build_examples.sh -c --verbose |
| 51 | +``` |
| 52 | + |
| 53 | +## Core Development Commands |
| 54 | + |
| 55 | +### Code Quality (Always Run Before Committing) |
| 56 | +```bash |
| 57 | +# Install linting tools |
| 58 | +pip install ruff==0.3.0 |
| 59 | + |
| 60 | +# Check code style (fast, <1 second) |
| 61 | +ruff check . |
| 62 | + |
| 63 | +# Fix auto-fixable issues |
| 64 | +ruff check . --fix |
| 65 | + |
| 66 | +# Format code |
| 67 | +ruff format . |
| 68 | +``` |
| 69 | + |
| 70 | +### Pre-commit Hooks (May Fail Due to Network Issues) |
| 71 | +```bash |
| 72 | +pip install pre-commit==3.6.0 |
| 73 | +pre-commit install |
| 74 | + |
| 75 | +# Run all hooks - NEVER CANCEL: Takes 2-5 minutes. Set timeout to 10+ minutes. |
| 76 | +# NOTE: May fail due to network timeouts - this is expected in some environments |
| 77 | +pre-commit run --all-files |
| 78 | +``` |
| 79 | + |
| 80 | +### Testing |
| 81 | +```bash |
| 82 | +# Note: Most tests require GPU hardware and ROCm environment |
| 83 | +# Basic test check (will fail without GPU libraries but shows test structure) |
| 84 | +python -m pytest tests/ -v |
| 85 | + |
| 86 | +# Shell-based integration tests (require built examples) |
| 87 | +./tests/test_matrix_transpose.sh |
| 88 | +``` |
| 89 | + |
| 90 | +## IntelliPerf Usage Patterns |
| 91 | + |
| 92 | +### Diagnose Only (Works Without GPU Optimization) |
| 93 | +```bash |
| 94 | +# Diagnose HIP application |
| 95 | +intelliperf --formula=diagnoseOnly -- ./examples/build/access_pattern/uncoalesced |
| 96 | + |
| 97 | +# Diagnose PyTorch application |
| 98 | +intelliperf --formula=diagnoseOnly -- python ./examples/torch/add.py |
| 99 | + |
| 100 | +# Diagnose Triton application |
| 101 | +TRITON_DISABLE_LINE_INFO=0 intelliperf --formula=diagnoseOnly -- python ./examples/triton/reduce.py |
| 102 | +``` |
| 103 | + |
| 104 | +### Full Optimization (Requires LLM API Key and GPU) |
| 105 | +```bash |
| 106 | +# Set required environment variable |
| 107 | +export LLM_GATEWAY_KEY="your_api_key_here" |
| 108 | + |
| 109 | +# Memory access optimization |
| 110 | +intelliperf --project_directory=./examples \ |
| 111 | + --build_command="./scripts/build_examples.sh -c" \ |
| 112 | + --formula=memoryAccess -- ./build/access_pattern/uncoalesced |
| 113 | + |
| 114 | +# Bank conflict optimization |
| 115 | +intelliperf --project_directory=./examples \ |
| 116 | + --build_command="./scripts/build_examples.sh -c" \ |
| 117 | + --formula=bankConflict -- ./build/bank_conflict/matrix_transpose 1024 1024 |
| 118 | + |
| 119 | +# Atomic contention optimization |
| 120 | +intelliperf --project_directory=./examples \ |
| 121 | + --build_command="./scripts/build_examples.sh -c" \ |
| 122 | + --instrument_command="./scripts/build_examples.sh -i -c" \ |
| 123 | + --formula=atomicContention -- ./build/contention/reduction |
| 124 | +``` |
| 125 | + |
| 126 | +## Manual Validation Requirements |
| 127 | + |
| 128 | +**CRITICAL**: After making any changes to IntelliPerf, ALWAYS run through these complete validation scenarios: |
| 129 | + |
| 130 | +### 1. Memory Access Pattern Validation |
| 131 | +```bash |
| 132 | +# Test uncoalesced memory access detection and optimization |
| 133 | +intelliperf --formula=memoryAccess --project_directory=./examples \ |
| 134 | + --build_command="./scripts/build_examples.sh -c" \ |
| 135 | + -- ./build/access_pattern/uncoalesced |
| 136 | + |
| 137 | +# Verify: Should show memory coalescing improvements and performance gains |
| 138 | +``` |
| 139 | + |
| 140 | +### 2. Bank Conflict Validation |
| 141 | +```bash |
| 142 | +# Test shared memory bank conflict detection and optimization |
| 143 | +intelliperf --formula=bankConflict --project_directory=./examples \ |
| 144 | + --build_command="./scripts/build_examples.sh -c" \ |
| 145 | + -- ./build/bank_conflict/matrix_transpose 1024 1024 |
| 146 | + |
| 147 | +# Verify: Should show bank conflict reduction and speedup |
| 148 | +``` |
| 149 | + |
| 150 | +### 3. Atomic Contention Validation |
| 151 | +```bash |
| 152 | +# Test atomic operation contention detection and optimization |
| 153 | +intelliperf --formula=atomicContention --project_directory=./examples \ |
| 154 | + --build_command="./scripts/build_examples.sh -c" \ |
| 155 | + --instrument_command="./scripts/build_examples.sh -i -c" \ |
| 156 | + -- ./build/contention/reduction |
| 157 | + |
| 158 | +# Verify: Should show atomic contention reduction and performance improvement |
| 159 | +``` |
| 160 | + |
| 161 | +### 4. Multi-Backend Diagnose Validation |
| 162 | +```bash |
| 163 | +# Test HIP application analysis |
| 164 | +intelliperf --formula=diagnoseOnly -- ./examples/build/access_pattern/uncoalesced |
| 165 | + |
| 166 | +# Test PyTorch application analysis |
| 167 | +intelliperf --formula=diagnoseOnly -- python ./examples/torch/add.py |
| 168 | + |
| 169 | +# Test Triton application analysis |
| 170 | +TRITON_DISABLE_LINE_INFO=0 intelliperf --formula=diagnoseOnly -- python ./examples/triton/reduce.py |
| 171 | + |
| 172 | +# Verify: All should generate valid performance analysis JSON output |
| 173 | +``` |
| 174 | + |
| 175 | +## Critical Timing and Timeout Information |
| 176 | + |
| 177 | +### Build Commands - NEVER CANCEL |
| 178 | +- **Python package install**: 90 seconds normal, set timeout to 3+ minutes |
| 179 | +- **External tools install**: 45+ minutes normal, set timeout to 60+ minutes |
| 180 | +- **Examples build**: 5-10 minutes normal, set timeout to 15+ minutes |
| 181 | +- **Pre-commit setup**: 2-5 minutes normal, set timeout to 10+ minutes |
| 182 | +- **IntelliPerf optimization runs**: 10-30 minutes normal, set timeout to 45+ minutes |
| 183 | + |
| 184 | +### Network Issues (Expected) |
| 185 | +- External dependency installation frequently fails due to network timeouts |
| 186 | +- Pre-commit hooks may fail to install due to PyPI timeouts |
| 187 | +- Document these as "may fail due to network limitations" rather than fixing |
| 188 | +- Use containers for reliable development environment |
| 189 | + |
| 190 | +## Repository Structure |
| 191 | + |
| 192 | +### Key Directories |
| 193 | +``` |
| 194 | +src/intelliperf/ # Main Python package |
| 195 | +src/accordo/ # Validation and correctness checking |
| 196 | +examples/ # Test applications in HIP, Triton, PyTorch |
| 197 | + scripts/build_examples.sh # Example build system |
| 198 | +external/ # External dependencies (rocprofiler-compute, omniprobe, nexus) |
| 199 | +tests/ # Integration tests (require GPU hardware) |
| 200 | +.github/workflows/ # CI that runs on AMD GPU droplets |
| 201 | +``` |
| 202 | + |
| 203 | +### Configuration Files |
| 204 | +- `pyproject.toml` - Python dependencies and tool configuration |
| 205 | +- `.pre-commit-config.yaml` - Code quality hooks |
| 206 | +- `.github/workflows/ci.yml` - Full GPU-based testing pipeline |
| 207 | +- `docker/` and `apptainer/` - Container definitions |
| 208 | + |
| 209 | +## Environment Requirements |
| 210 | + |
| 211 | +### Minimal (Python Development) |
| 212 | +- Python 3.8+ |
| 213 | +- pip |
| 214 | + |
| 215 | +### Full Functionality |
| 216 | +- ROCm/HIP environment |
| 217 | +- AMD GPU hardware (tested on MI300X) |
| 218 | +- Network access for dependency installation |
| 219 | +- LLM API key for optimization features |
| 220 | + |
| 221 | +## Common Issues and Solutions |
| 222 | + |
| 223 | +### "ROCm not found" Error |
| 224 | +- Expected in non-GPU environments |
| 225 | +- Use containers for full GPU functionality |
| 226 | +- Python-only features still work (CLI, some validation) |
| 227 | + |
| 228 | +### Network Timeout Errors |
| 229 | +- Very common with `python3 scripts/install_tool.py --all` |
| 230 | +- Expected with pre-commit installation |
| 231 | +- Document as limitation rather than trying to fix |
| 232 | +- Use containers which have dependencies pre-installed |
| 233 | + |
| 234 | +### Test Failures Without GPU |
| 235 | +- Expected - most tests require GPU hardware |
| 236 | +- CI runs on actual AMD GPU droplets |
| 237 | +- Focus on code quality checks for local development |
| 238 | + |
| 239 | +### Performance Validation |
| 240 | +- Always test at least one complete optimization scenario after changes |
| 241 | +- Verify JSON output contains expected performance metrics |
| 242 | +- Check that both correctness and performance validation pass |
| 243 | + |
| 244 | +## CI Integration |
| 245 | + |
| 246 | +The CI system (.github/workflows/ci.yml) runs comprehensive tests on AMD GPU hardware: |
| 247 | +- Spins up GPU droplets with MI300X hardware |
| 248 | +- Installs full dependency chain |
| 249 | +- Tests all optimization formulas |
| 250 | +- Validates correctness and performance improvements |
| 251 | +- NEVER CANCEL: CI can take 45+ minutes including droplet provisioning |
| 252 | + |
| 253 | +Always ensure your changes pass both local code quality checks and will work in the GPU CI environment. |
| 254 | + |
| 255 | +Always ensure your changes pass both local code quality checks and will work in the GPU CI environment. |
0 commit comments