|
| 1 | +# OpenEvolve AlgoTune Benchmark Report: Gemini Flash 2.5 Experiment |
| 2 | + |
| 3 | +## Executive Summary |
| 4 | + |
| 5 | +This report documents the comprehensive evaluation of Google's Gemini Flash 2.5 model using OpenEvolve to optimize code across 8 AlgoTune benchmark tasks. The experiment ran for 114.6 minutes with a 100% success rate, discovering significant algorithmic improvements in 2 out of 8 tasks, including a remarkable 189.94x speedup for 2D convolution operations. |
| 6 | + |
| 7 | +## Experiment Configuration |
| 8 | + |
| 9 | +### Model Settings |
| 10 | +- **Model**: Google Gemini Flash 2.5 (`google/gemini-2.5-flash`) |
| 11 | +- **Temperature**: 0.4 (optimal based on prior tuning) |
| 12 | +- **Max Tokens**: 16,000 |
| 13 | +- **Evolution Strategy**: Diff-based evolution |
| 14 | +- **API Provider**: OpenRouter |
| 15 | + |
| 16 | +### Evolution Parameters |
| 17 | +- **Iterations per task**: 100 |
| 18 | +- **Checkpoint interval**: Every 10 iterations |
| 19 | +- **Population size**: 1,000 programs |
| 20 | +- **Number of islands**: 4 (for diversity) |
| 21 | +- **Migration interval**: Every 20 generations |
| 22 | + |
| 23 | +### Evaluation Settings |
| 24 | +- **Cascade evaluation**: Enabled with 3 stages |
| 25 | +- **Stage 2 timeout**: 200 seconds |
| 26 | +- **Number of trials**: 5 test cases per evaluation |
| 27 | +- **Timing runs**: 3 runs + 1 warmup per trial |
| 28 | +- **Total executions per evaluation**: 16 |
| 29 | + |
| 30 | +## Critical Issue and Resolution |
| 31 | + |
| 32 | +### The Data Size Problem |
| 33 | +Initially, all tasks were timing out during Stage 2 evaluation despite individual runs taking only ~60 seconds. Investigation revealed: |
| 34 | + |
| 35 | +- **Root cause**: Each evaluation actually performs 16 executions (5 trials × 3 timing runs + warmup) |
| 36 | +- **Original calculation**: 60 seconds × 16 = 960 seconds > 200-second timeout |
| 37 | +- **Solution**: Reduced data_size parameters by factor of ~16 |
| 38 | + |
| 39 | +### Adjusted Data Sizes |
| 40 | +| Task | Original | Adjusted | Reduction Factor | |
| 41 | +|------|----------|----------|-----------------| |
| 42 | +| affine_transform_2d | 2000 | 100 | 20x | |
| 43 | +| convolve2d_full_fill | 20 | 5 | 4x | |
| 44 | +| eigenvectors_complex | 400 | 25 | 16x | |
| 45 | +| fft_cmplx_scipy_fftpack | 1500 | 95 | 15.8x | |
| 46 | +| fft_convolution | 2000 | 125 | 16x | |
| 47 | +| lu_factorization | 400 | 25 | 16x | |
| 48 | +| polynomial_real | 8000 | 500 | 16x | |
| 49 | +| psd_cone_projection | 600 | 35 | 17.1x | |
| 50 | + |
| 51 | +## Results Overview |
| 52 | + |
| 53 | +### Performance Summary |
| 54 | +| Task | Speedup | Combined Score | Runtime (s) | Status | |
| 55 | +|------|---------|----------------|-------------|---------| |
| 56 | +| convolve2d_full_fill | **189.94x** 🚀 | 0.955 | 643.2 | ✅ | |
| 57 | +| psd_cone_projection | **2.37x** 🔥 | 0.975 | 543.5 | ✅ | |
| 58 | +| eigenvectors_complex | 1.074x | 0.974 | 1213.2 | ✅ | |
| 59 | +| lu_factorization | 1.062x | 0.987 | 727.9 | ✅ | |
| 60 | +| affine_transform_2d | 1.053x | 0.939 | 577.5 | ✅ | |
| 61 | +| polynomial_real | 1.036x | 0.801 | 2181.3 | ✅ | |
| 62 | +| fft_cmplx_scipy_fftpack | 1.017x | 0.984 | 386.5 | ✅ | |
| 63 | +| fft_convolution | 1.014x | 0.987 | 605.6 | ✅ | |
| 64 | + |
| 65 | +### Key Metrics |
| 66 | +- **Total runtime**: 114.6 minutes |
| 67 | +- **Success rate**: 100% (8/8 tasks) |
| 68 | +- **Tasks with significant optimization**: 2/8 (25%) |
| 69 | +- **Tasks with minor improvements**: 6/8 (75%) |
| 70 | +- **Average time per task**: 14.3 minutes |
| 71 | + |
| 72 | +## Detailed Analysis of Optimizations |
| 73 | + |
| 74 | +### 1. convolve2d_full_fill - 189.94x Speedup (Major Success) |
| 75 | + |
| 76 | +**Original Implementation:** |
| 77 | +```python |
| 78 | +def solve(self, problem): |
| 79 | + a, b = problem |
| 80 | + result = signal.convolve2d(a, b, mode=self.mode, boundary=self.boundary) |
| 81 | + return result |
| 82 | +``` |
| 83 | + |
| 84 | +**Evolved Implementation:** |
| 85 | +```python |
| 86 | +def solve(self, problem): |
| 87 | + a_in, b_in = problem |
| 88 | + # Ensure inputs are float64 and C-contiguous for optimal performance with FFT |
| 89 | + a = a_in if a_in.flags['C_CONTIGUOUS'] and a_in.dtype == np.float64 else np.ascontiguousarray(a_in, dtype=np.float64) |
| 90 | + b = b_in if b_in.flags['C_CONTIGUOUS'] and b_in.dtype == np.float64 else np.ascontiguousarray(b_in, dtype=np.float64) |
| 91 | + result = signal.fftconvolve(a, b, mode=self.mode) |
| 92 | + return result |
| 93 | +``` |
| 94 | + |
| 95 | +**Key Optimizations:** |
| 96 | +- **Algorithmic change**: Switched from `convolve2d` (O(n⁴)) to `fftconvolve` (O(n²log n)) |
| 97 | +- **Memory optimization**: Ensured C-contiguous memory layout for FFT efficiency |
| 98 | +- **Type optimization**: Explicit float64 dtype for numerical stability |
| 99 | + |
| 100 | +### 2. psd_cone_projection - 2.37x Speedup (Moderate Success) |
| 101 | + |
| 102 | +**Original Implementation:** |
| 103 | +```python |
| 104 | +def solve(self, problem): |
| 105 | + A = problem["matrix"] |
| 106 | + # Standard eigendecomposition |
| 107 | + eigvals, eigvecs = np.linalg.eig(A) |
| 108 | + eigvals = np.maximum(eigvals, 0) |
| 109 | + X = eigvecs @ np.diag(eigvals) @ eigvecs.T |
| 110 | + return {"projection": X} |
| 111 | +``` |
| 112 | + |
| 113 | +**Evolved Implementation:** |
| 114 | +```python |
| 115 | +def solve(self, problem): |
| 116 | + A = problem["matrix"] |
| 117 | + # Use eigh for symmetric matrices for better performance and numerical stability |
| 118 | + eigvals, eigvecs = np.linalg.eigh(A) |
| 119 | + # Clip negative eigenvalues to zero |
| 120 | + eigvals = np.maximum(eigvals, 0) |
| 121 | + # Optimized matrix multiplication: multiply eigvecs with eigvals first |
| 122 | + X = (eigvecs * eigvals) @ eigvecs.T |
| 123 | + return {"projection": X} |
| 124 | +``` |
| 125 | + |
| 126 | +**Key Optimizations:** |
| 127 | +- **Specialized function**: Used `eigh` instead of `eig` for symmetric matrices |
| 128 | +- **Optimized multiplication**: Changed from `eigvecs @ np.diag(eigvals) @ eigvecs.T` to `(eigvecs * eigvals) @ eigvecs.T` |
| 129 | +- **Better numerical stability**: `eigh` guarantees real eigenvalues for symmetric matrices |
| 130 | + |
| 131 | +### 3. Minor Optimizations (1.01x - 1.07x Speedup) |
| 132 | + |
| 133 | +**affine_transform_2d (1.053x):** |
| 134 | +```python |
| 135 | +# Original |
| 136 | +image = problem["image"] |
| 137 | +matrix = problem["matrix"] |
| 138 | + |
| 139 | +# Evolved |
| 140 | +image = np.asarray(problem["image"], dtype=float) |
| 141 | +matrix = np.asarray(problem["matrix"], dtype=float) |
| 142 | +``` |
| 143 | +- Added explicit type conversion to avoid runtime type checking |
| 144 | + |
| 145 | +**Other tasks** showed no visible code changes, suggesting: |
| 146 | +- Speedups likely due to measurement variance |
| 147 | +- Minor internal optimizations not visible in source |
| 148 | +- Statistical noise in timing measurements |
| 149 | + |
| 150 | +## What Worked Well |
| 151 | + |
| 152 | +### 1. Evolution Discovery Capabilities |
| 153 | +- Successfully discovered FFT-based convolution optimization (189x speedup) |
| 154 | +- Found specialized functions for symmetric matrices (2.37x speedup) |
| 155 | +- Identified memory layout optimizations |
| 156 | + |
| 157 | +### 2. Configuration Optimizations |
| 158 | +- Diff-based evolution worked better than full rewrites for Gemini |
| 159 | +- Temperature 0.4 provided good balance between exploration and exploitation |
| 160 | +- Island-based evolution maintained diversity |
| 161 | + |
| 162 | +### 3. System Robustness |
| 163 | +- 100% task completion rate after data size adjustment |
| 164 | +- No crashes or critical failures |
| 165 | +- Checkpoint system allowed progress tracking |
| 166 | + |
| 167 | +## What Didn't Work |
| 168 | + |
| 169 | +### 1. Limited Optimization Discovery |
| 170 | +- 6 out of 8 tasks showed minimal improvements (<7%) |
| 171 | +- Most baseline implementations were already near-optimal |
| 172 | +- Evolution struggled to find improvements for already-optimized code |
| 173 | + |
| 174 | +### 2. Initial Configuration Issues |
| 175 | +- Original data_size values caused timeouts |
| 176 | +- Required manual intervention to adjust parameters |
| 177 | +- Cascade evaluation timing wasn't initially accounted for |
| 178 | + |
| 179 | +### 3. Minor Perturbations vs Real Optimizations |
| 180 | +- Many "improvements" were just measurement noise |
| 181 | +- Small type conversions counted as optimizations |
| 182 | +- Difficult to distinguish real improvements from variance |
| 183 | + |
| 184 | +## Lessons Learned |
| 185 | + |
| 186 | +### 1. Evaluation Complexity |
| 187 | +- Must account for total execution count (trials × runs × warmup) |
| 188 | +- Cascade evaluation adds significant overhead |
| 189 | +- Timeout settings need careful calibration |
| 190 | + |
| 191 | +### 2. Baseline Quality Matters |
| 192 | +- Well-optimized baselines leave little room for improvement |
| 193 | +- AlgoTune baselines already use efficient libraries (scipy, numpy) |
| 194 | +- Major improvements only possible with algorithmic changes |
| 195 | + |
| 196 | +### 3. Evolution Effectiveness |
| 197 | +- Works best when alternative algorithms exist (convolve2d → fftconvolve) |
| 198 | +- Can find specialized functions (eig → eigh) |
| 199 | +- Struggles with micro-optimizations |
| 200 | + |
| 201 | +## Recommendations for Future Experiments |
| 202 | + |
| 203 | +### 1. Task Selection |
| 204 | +- Include tasks with known suboptimal baseline implementations |
| 205 | +- Add problems where multiple algorithmic approaches exist |
| 206 | +- Consider more complex optimization scenarios |
| 207 | + |
| 208 | +### 2. Configuration Tuning |
| 209 | +- Pre-calculate total execution time for data sizing |
| 210 | +- Consider reducing trials/runs for faster iteration |
| 211 | +- Adjust timeout based on actual execution patterns |
| 212 | + |
| 213 | +### 3. Model Comparison Setup |
| 214 | +For comparing with other models (e.g., Claude, GPT-4): |
| 215 | +- Use identical configuration parameters |
| 216 | +- Run on same hardware for fair comparison |
| 217 | +- Track both speedup and code quality metrics |
| 218 | +- Document any model-specific adjustments needed |
| 219 | + |
| 220 | +## Conclusion |
| 221 | + |
| 222 | +The Gemini Flash 2.5 experiment demonstrated OpenEvolve's capability to discover significant algorithmic improvements when they exist. The system achieved a 189.94x speedup on 2D convolution by automatically discovering FFT-based methods and a 2.37x speedup on PSD projection through specialized matrix operations. |
| 223 | + |
| 224 | +However, the experiment also revealed that for well-optimized baseline implementations, evolution produces minimal improvements. The 25% success rate for finding meaningful optimizations suggests that careful task selection is crucial for demonstrating evolutionary code optimization effectiveness. |
| 225 | + |
| 226 | +### Next Steps |
| 227 | +1. Run identical benchmark with alternative LLM models |
| 228 | +2. Compare optimization discovery rates across models |
| 229 | +3. Analyze code quality and correctness across different models |
| 230 | +4. Document model-specific strengths and weaknesses |
| 231 | + |
| 232 | +--- |
| 233 | + |
| 234 | +**Experiment Details:** |
| 235 | +- Date: August 14, 2025 |
| 236 | +- Duration: 114.6 minutes |
| 237 | +- Hardware: MacOS (Darwin 24.5.0) |
| 238 | +- OpenEvolve Version: Current main branch |
| 239 | +- API Provider: OpenRouter |
0 commit comments