Performance benchmarking tools for measuring and comparing cold start times across different code changes.
# Run benchmark on current branch
uv run pytest tests/test_performance/test_cold_start.py
# Compare two branches
./scripts/benchmark_cold_start.sh main my-feature-branch
# Compare two existing result files
uv run python scripts/compare_benchmarks.py benchmark_results/cold_start_baseline.json benchmark_results/cold_start_latest.json- Import times:
import runpod,import runpod.serverless,import runpod.endpoint - Module counts: Total modules loaded and runpod-specific modules
- Lazy loading status: Whether paramiko and SSH CLI are eagerly or lazy-loaded
- Statistics: Min, max, mean, median across 10 iterations per measurement
Core benchmark test that measures import performance in fresh Python subprocesses.
# Run as pytest test
uv run pytest tests/test_performance/test_cold_start.py -v
# Run as standalone script
uv run python tests/test_performance/test_cold_start.py
# Results saved to:
# - benchmark_results/cold_start_<timestamp>.json
# - benchmark_results/cold_start_latest.json (always latest)Output Example:
Running cold start benchmarks...
------------------------------------------------------------
Measuring 'import runpod'...
Mean: 273.29ms
Measuring 'import runpod.serverless'...
Mean: 332.18ms
Counting loaded modules...
Total modules: 582
Runpod modules: 46
Checking if paramiko is eagerly loaded...
Paramiko loaded: False
Automated benchmark runner that handles git branch switching, dependency installation, and result collection.
# Run on current branch (no git operations)
./scripts/benchmark_cold_start.sh
# Run on specific branch
./scripts/benchmark_cold_start.sh main
# Compare two branches (runs both, then compares)
./scripts/benchmark_cold_start.sh main feature/lazy-loadingFeatures:
- Automatic stash/unstash of uncommitted changes
- Dependency installation per branch
- Safe branch switching with restoration
- Timestamped result files
- Automatic comparison when comparing branches
Safety:
- Stashes uncommitted changes before switching branches
- Restores original branch after completion
- Handles errors gracefully
Analyzes and visualizes differences between two benchmark runs with colored terminal output.
uv run python scripts/compare_benchmarks.py <baseline.json> <optimized.json>Output Example:
======================================================================
COLD START BENCHMARK COMPARISON
======================================================================
IMPORT TIME COMPARISON
----------------------------------------------------------------------
Metric Baseline Optimized Δ ms Δ %
----------------------------------------------------------------------
runpod_total 285.64ms 273.29ms ↓ 12.35ms 4.32%
runpod_serverless 376.33ms 395.14ms ↑ -18.81ms -5.00%
runpod_endpoint 378.61ms 399.36ms ↑ -20.75ms -5.48%
MODULE LOAD COMPARISON
----------------------------------------------------------------------
Total modules loaded:
Baseline: 698 Optimized: 582 Δ: 116
Runpod modules loaded:
Baseline: 48 Optimized: 46 Δ: 2
LAZY LOADING STATUS
----------------------------------------------------------------------
Paramiko Baseline: LOADED Optimized: NOT LOADED ✓ NOW LAZY
SSH CLI Baseline: LOADED Optimized: NOT LOADED ✓ NOW LAZY
======================================================================
SUMMARY
======================================================================
✓ Cold start improved by 12.35ms
✓ That's a 4.3% improvement over baseline
✓ Baseline: 285.64ms → Optimized: 273.29ms
======================================================================
Color coding:
- Green: Improvements (faster times, lazy loading achieved)
- Red: Regressions (slower times, eager loading introduced)
- Yellow: No change
All benchmark results are saved to benchmark_results/ (gitignored).
File naming:
cold_start_<timestamp>.json- Timestamped resultcold_start_latest.json- Always contains most recent resultcold_start_baseline.json- Manually saved baseline for comparison
JSON structure:
{
"timestamp": 1763179522.0437188,
"python_version": "3.8.20 (default, Oct 2 2024, 16:12:59) [Clang 18.1.8 ]",
"measurements": {
"runpod_total": {
"min": 375.97,
"max": 527.9,
"mean": 393.91,
"median": 380.4,
"iterations": 10
}
},
"module_counts": {
"total": 698,
"filtered": 48
},
"paramiko_eagerly_loaded": true,
"ssh_cli_loaded": true
}# 1. Save baseline on main branch
git checkout main
./scripts/benchmark_cold_start.sh
cp benchmark_results/cold_start_latest.json benchmark_results/cold_start_baseline.json
# 2. Switch to feature branch
git checkout feature/my-optimization
# 3. Run benchmark and compare
./scripts/benchmark_cold_start.sh
uv run python scripts/compare_benchmarks.py \
benchmark_results/cold_start_baseline.json \
benchmark_results/cold_start_latest.json# Compare three different optimization branches
./scripts/benchmark_cold_start.sh main > results_main.txt
./scripts/benchmark_cold_start.sh feature/approach-1 > results_1.txt
./scripts/benchmark_cold_start.sh feature/approach-2 > results_2.txt
# Then compare each against baseline
uv run python scripts/compare_benchmarks.py \
benchmark_results/cold_start_main_*.json \
benchmark_results/cold_start_approach-1_*.jsonAdd to your GitHub Actions workflow:
- name: Run cold start benchmark
run: |
uv run pytest tests/test_performance/test_cold_start.py --timeout=120
- name: Upload benchmark results
uses: actions/upload-artifact@v3
with:
name: benchmark-results
path: benchmark_results/cold_start_latest.jsonBased on testing with Python 3.8:
- Cold start (import runpod): < 300ms (mean)
- Serverless import: < 400ms (mean)
- Module count: < 600 total modules
- Test assertion: Fails if import > 1000ms
Subprocess-based measurements have inherent variance:
- First run in sequence: Often 20-50ms slower (Python startup overhead)
- Subsequent runs: More stable
- Use median or mean for comparison, not single runs
- Fewer modules = faster cold start: Each module has import overhead
- Runpod-specific modules: Should be minimal (40-50)
- Total modules: Includes stdlib and dependencies
- Target reduction: Removing 100+ modules typically saves 10-30ms
paramiko_eagerly_loaded: false- Good for serverless workersssh_cli_loaded: false- Good for SDK users- These should only be
truewhen CLI commands are invoked
If you see >100ms variance between runs:
- System is under load
- Disk I/O contention
- Python bytecode cache issues
Solution: Run multiple times and use median values.
# Check git status
git status
# Manually restore if script failed mid-execution
git checkout <original-branch>
git stash popEnsure dependencies are installed:
uv sync --group test- Iterations: 10 per measurement (configurable in test)
- Process isolation: Each measurement uses fresh subprocess
- Python cache: Cleared by subprocess creation
- System state: Cannot control OS-level caching
For production performance testing, consider:
- Running on CI with consistent environment
- Multiple runs at different times
- Comparing trends over multiple commits