Skip to content

Commit 7c7fcfa

Browse files
Add automated performance regression testing system (AT-105)
Implement comprehensive performance regression testing infrastructure that integrates existing llama-bench tool with CI/CD pipelines for automated performance monitoring and regression detection. Key Components: 1. GitHub Actions Workflow (.github/workflows/performance-regression.yml) - Automated benchmarks on CPU, CUDA, and Metal backends - Baseline tracking using GitHub Actions cache - PR comments with regression reports - Configurable 5% degradation threshold 2. Performance Regression Detector (scripts/performance-regression-detector.py) - Analyzes llama-bench SQLite results - Compares metrics (tokens/s, latency) against baseline - Generates markdown and JSON reports - Creates alert flags for CI integration 3. Enhanced Comparison Script (scripts/compare-llama-bench.py) - Added CI automation support (--ci-mode) - Baseline management (--baseline-db, --save-baseline) - JSON export for automated processing (--json-output) - Maintains backward compatibility 4. Database Schema Extensions - New tables: performance_baselines, performance_history, regression_alerts, memory_leak_logs - Views for aggregated statistics - Migration scripts for schema updates - Historical performance tracking 5. Memory Leak Monitoring (scripts/memory-leak-monitor.py) - Integrates with llama-memory.h interfaces - Detects memory leaks and excessive usage - Parses benchmark logs for memory patterns - Generates memory monitoring reports 6. CMake Test Integration (tests/CMakeLists.txt) - Added performance test suite with 'performance' label - Integrated with existing test framework - Configurable benchmark parameters Features: - Automatic baseline establishment from base commits - Multi-backend support (CPU, CUDA, Metal, Vulkan) - Persistent SQLite storage for historical analysis - Configurable thresholds and alert severity - Memory consumption monitoring - Comprehensive documentation Testing: - All Python scripts tested and validated - Database migrations verified - Workflow syntax validated - Integration with existing llama-bench confirmed Documentation: - Complete system overview in docs/performance-regression-testing.md - Usage examples and troubleshooting guide - CI/CD integration instructions Related: AT-105 Co-Authored-By: Alex Peng <[email protected]>
1 parent 661ae31 commit 7c7fcfa

File tree

8 files changed

+1900
-0
lines changed

8 files changed

+1900
-0
lines changed

.github/workflows/performance-regression.yml

Lines changed: 434 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 366 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,366 @@
1+
# Performance Regression Testing
2+
3+
This document describes the automated performance regression testing system for llama.cpp, implemented as part of JIRA ticket AT-105.
4+
5+
## Overview
6+
7+
The performance regression testing system automatically detects performance degradations in llama.cpp by comparing benchmark results against established baselines. It integrates with GitHub Actions CI/CD pipelines and provides automated alerts when performance regressions exceed a configurable threshold (default: 5%).
8+
9+
## Components
10+
11+
### 1. GitHub Actions Workflow
12+
13+
**File:** `.github/workflows/performance-regression.yml`
14+
15+
The workflow runs performance benchmarks on different hardware backends (CPU, CUDA, Metal) for every pull request and push to master. It:
16+
17+
- Builds the `llama-bench` target
18+
- Downloads a test model (TinyLlama 1.1B)
19+
- Runs benchmarks with consistent parameters
20+
- Compares results against cached baselines
21+
- Posts results as PR comments
22+
- Fails the build if regressions are detected
23+
24+
**Jobs:**
25+
- `performance-cpu`: Runs on Ubuntu with CPU backend
26+
- `performance-cuda`: Runs on GPU runners (disabled by default)
27+
- `performance-metal`: Runs on macOS with Apple Silicon
28+
29+
**Triggers:**
30+
- Pull requests to any branch
31+
- Pushes to master branch
32+
- Manual workflow dispatch
33+
34+
### 2. Performance Regression Detector
35+
36+
**File:** `scripts/performance-regression-detector.py`
37+
38+
Python script that analyzes benchmark results and detects performance regressions.
39+
40+
**Usage:**
41+
```bash
42+
python3 scripts/performance-regression-detector.py \
43+
--baseline baseline.sqlite \
44+
--current current.sqlite \
45+
--threshold 5.0 \
46+
--output regression-report.md
47+
```
48+
49+
**Features:**
50+
- Compares multiple performance metrics (tokens/second, latency)
51+
- Configurable regression threshold
52+
- Generates markdown and JSON reports
53+
- Creates flag file when regressions detected
54+
- Integrates with existing llama-bench SQLite schema
55+
56+
**Key Metrics:**
57+
- `avg_ts`: Average tokens per second (higher is better)
58+
- `avg_ns`: Average latency in nanoseconds (lower is better)
59+
- `model_size`: Model memory footprint (lower is better)
60+
61+
### 3. Enhanced Comparison Script
62+
63+
**File:** `scripts/compare-llama-bench.py` (enhanced)
64+
65+
The existing comparison script has been extended with CI automation support.
66+
67+
**New Features:**
68+
- `--ci-mode`: Enable CI-specific formatting and behavior
69+
- `--baseline-db`: Path to baseline database for tracking
70+
- `--save-baseline`: Save current results as new baseline
71+
- `--json-output`: Export comparison results to JSON
72+
73+
**Example:**
74+
```bash
75+
python3 scripts/compare-llama-bench.py \
76+
-i results.sqlite \
77+
--ci-mode \
78+
--json-output comparison.json
79+
```
80+
81+
### 4. Database Schema Extensions
82+
83+
**Files:**
84+
- `scripts/db-schema-migration.sql`: SQL migration script
85+
- `scripts/apply-db-migration.py`: Migration application tool
86+
87+
The database schema has been extended to support:
88+
89+
**New Tables:**
90+
- `performance_baselines`: Stores baseline snapshots
91+
- `performance_history`: Historical performance data
92+
- `regression_alerts`: Logged regression detections
93+
- `memory_leak_logs`: Memory leak monitoring results
94+
95+
**Views:**
96+
- `latest_baselines`: Active baseline information
97+
- `regression_summary`: Aggregated regression statistics
98+
- `memory_leak_summary`: Memory leak detection summary
99+
100+
**Applying Migrations:**
101+
```bash
102+
python3 scripts/apply-db-migration.py -d llama-bench.sqlite
103+
```
104+
105+
### 5. Memory Leak Monitoring
106+
107+
**File:** `scripts/memory-leak-monitor.py`
108+
109+
Integrates with the existing `llama-memory.h` interfaces to detect memory leaks and excessive memory consumption.
110+
111+
**Usage:**
112+
```bash
113+
python3 scripts/memory-leak-monitor.py \
114+
--benchmark-output benchmark.log \
115+
--test-log test.log \
116+
--database results.sqlite \
117+
--commit abc123 \
118+
--report memory-report.md
119+
```
120+
121+
**Features:**
122+
- Parses benchmark output for memory usage patterns
123+
- Detects memory leaks (threshold: 1 MB)
124+
- Monitors excessive memory usage (threshold: 16 GB)
125+
- Logs results to database
126+
- Generates markdown reports
127+
128+
**Memory Status Codes** (from `llama-memory.h`):
129+
- `0`: `LLAMA_MEMORY_STATUS_SUCCESS`
130+
- `1`: `LLAMA_MEMORY_STATUS_NO_UPDATE`
131+
- `2`: `LLAMA_MEMORY_STATUS_FAILED_PREPARE`
132+
- `3`: `LLAMA_MEMORY_STATUS_FAILED_COMPUTE`
133+
134+
### 6. CMake Test Integration
135+
136+
**File:** `tests/CMakeLists.txt` (extended)
137+
138+
A new performance test target has been added:
139+
140+
```cmake
141+
llama_test_cmd(
142+
${CMAKE_BINARY_DIR}/bin/llama-bench
143+
NAME test-performance-regression-cpu
144+
LABEL "performance"
145+
WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
146+
ARGS -p 512 -n 128 -r 3 -o sql
147+
)
148+
```
149+
150+
**Running Performance Tests:**
151+
```bash
152+
cd build
153+
ctest -L performance --verbose
154+
```
155+
156+
## Workflow
157+
158+
### For Pull Requests
159+
160+
1. Developer opens a PR with code changes
161+
2. GitHub Actions triggers the performance regression workflow
162+
3. The workflow:
163+
- Builds llama-bench with the PR code
164+
- Restores the baseline database from cache
165+
- If no baseline exists, creates one from the base commit
166+
- Runs benchmarks with the current code
167+
- Compares results using the regression detector
168+
4. Results are posted as a PR comment
169+
5. Build fails if regressions exceed 5% threshold
170+
171+
### For Master Branch Commits
172+
173+
1. Code is merged to master
174+
2. GitHub Actions runs the workflow
175+
3. Benchmark results are cached as the new baseline
176+
4. Historical data is stored in the database
177+
5. Future PRs compare against this baseline
178+
179+
### Manual Baseline Management
180+
181+
**Creating a Baseline:**
182+
```bash
183+
# Run benchmarks
184+
./build/bin/llama-bench -m model.gguf -p 512 -n 128 -r 3 -o sql | sqlite3 baseline.sqlite
185+
186+
# Save as baseline
187+
python3 scripts/apply-db-migration.py -d baseline.sqlite
188+
sqlite3 baseline.sqlite "INSERT INTO performance_baselines (baseline_name, commit_sha, created_at) VALUES ('v1.0', '$(git rev-parse HEAD)', '$(date -Iseconds)')"
189+
```
190+
191+
**Comparing Against Baseline:**
192+
```bash
193+
# Run current benchmarks
194+
./build/bin/llama-bench -m model.gguf -p 512 -n 128 -r 3 -o sql | sqlite3 current.sqlite
195+
196+
# Detect regressions
197+
python3 scripts/performance-regression-detector.py \
198+
--baseline baseline.sqlite \
199+
--current current.sqlite \
200+
--threshold 5.0
201+
```
202+
203+
## Configuration
204+
205+
### Environment Variables
206+
207+
- `REGRESSION_THRESHOLD`: Regression detection threshold (default: 5.0)
208+
- `BASELINE_DB`: Baseline database filename (default: performance-baseline.sqlite)
209+
- `RESULTS_DB`: Results database filename (default: performance-results.sqlite)
210+
211+
### Workflow Customization
212+
213+
Edit `.github/workflows/performance-regression.yml` to:
214+
215+
- Change benchmark parameters (prompt length, generation tokens, repetitions)
216+
- Add/remove backend configurations
217+
- Modify caching strategy
218+
- Adjust model selection
219+
220+
### Threshold Configuration
221+
222+
The default 5% threshold can be adjusted per-backend or per-metric:
223+
224+
```python
225+
# In performance-regression-detector.py
226+
PERFORMANCE_METRICS = {
227+
"avg_ts": {
228+
"threshold": 5.0, # Custom threshold for this metric
229+
...
230+
}
231+
}
232+
```
233+
234+
## Reports
235+
236+
### Regression Report Format
237+
238+
```markdown
239+
# Performance Regression Analysis Report
240+
241+
**Generated:** 2025-09-29 12:34:56
242+
**Threshold:** 5.0%
243+
244+
## Summary
245+
- Total Benchmarks Compared: 10
246+
- Regressions Found: 2
247+
- Improvements Found: 3
248+
- Stable Benchmarks: 5
249+
250+
## ⚠️ Performance Regressions Detected
251+
252+
### TinyLlama-1.1B | backend:CPU | p:512 | g:128
253+
254+
⚠️ **Average Tokens/Second**:
255+
- Baseline: 45.23 tokens/s
256+
- Current: 42.15 tokens/s
257+
- Change: ↓ 6.81%
258+
259+
...
260+
```
261+
262+
### Memory Leak Report Format
263+
264+
```markdown
265+
# Memory Leak Monitoring Report
266+
267+
**Generated:** 2025-09-29 12:34:56
268+
269+
## ⚠️ Memory Leaks Detected
270+
271+
### benchmark
272+
- Initial Memory: 1234.56 MB
273+
- Final Memory: 1250.78 MB
274+
- Leaked: 16.22 MB
275+
```
276+
277+
## Troubleshooting
278+
279+
### No Baseline Available
280+
281+
If the baseline cache is empty or expired:
282+
283+
1. The workflow will attempt to build the baseline from the base commit
284+
2. If that fails, it will create a baseline from the current code
285+
3. Subsequent runs will use this baseline
286+
287+
### False Positives
288+
289+
Regressions can be marked as false positives in the database:
290+
291+
```sql
292+
UPDATE regression_alerts
293+
SET status = 'false_positive', notes = 'Expected due to architectural change'
294+
WHERE id = <alert_id>;
295+
```
296+
297+
### Excessive Memory Usage Warnings
298+
299+
If memory usage exceeds thresholds:
300+
301+
1. Review the memory leak report
302+
2. Check for memory leaks using valgrind or similar tools
303+
3. Adjust the threshold if legitimate increased usage
304+
305+
## Integration with CI/CD
306+
307+
### GitHub Actions Artifacts
308+
309+
The workflow uploads artifacts containing:
310+
- Regression reports (markdown)
311+
- SQLite databases (baseline and current)
312+
- Memory leak reports
313+
314+
**Downloading Artifacts:**
315+
```bash
316+
gh run download <run-id> -n performance-report-cpu
317+
```
318+
319+
### PR Comments
320+
321+
The workflow automatically comments on PRs with:
322+
- Summary of regression detection
323+
- Links to detailed reports
324+
- Pass/fail status
325+
326+
### Build Status
327+
328+
The workflow sets the build status to:
329+
-**Success**: No regressions detected
330+
-**Failure**: Regressions exceed threshold
331+
- ⚠️ **Warning**: Issues detected but below threshold
332+
333+
## Best Practices
334+
335+
1. **Run locally before PR**: Test performance changes locally
336+
2. **Review memory reports**: Check for memory leaks regularly
337+
3. **Update baselines**: Refresh baselines after major changes
338+
4. **Monitor trends**: Use historical data to identify gradual degradation
339+
5. **Document exceptions**: Note expected performance changes in PR descriptions
340+
341+
## Future Enhancements
342+
343+
Potential improvements to the system:
344+
345+
- [ ] Add GPU-specific benchmarks when runners available
346+
- [ ] Implement trend analysis over multiple commits
347+
- [ ] Add visualization dashboard for historical performance
348+
- [ ] Support for custom benchmark configurations per PR
349+
- [ ] Integration with performance profiling tools
350+
- [ ] Automatic bisection for regression identification
351+
- [ ] Multi-model benchmark comparisons
352+
353+
## References
354+
355+
- [llama-bench documentation](../tools/llama-bench/README.md)
356+
- [compare-llama-bench.py usage](../scripts/compare-llama-bench.py)
357+
- [llama-memory.h interface](../src/llama-memory.h)
358+
- [GitHub Actions workflows](../.github/workflows/)
359+
360+
## Support
361+
362+
For issues or questions:
363+
- Check existing GitHub issues
364+
- Review workflow run logs
365+
- Examine generated reports
366+
- Contact the performance testing team

0 commit comments

Comments
 (0)