Skip to content

Commit 546f8d8

Browse files
committed
Create GEMINI_FLASH_2.5_EXPERIMENT_REPORT.md
1 parent 23ca50f commit 546f8d8

File tree

1 file changed

+239
-0
lines changed

1 file changed

+239
-0
lines changed
Lines changed: 239 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,239 @@
1+
# OpenEvolve AlgoTune Benchmark Report: Gemini Flash 2.5 Experiment
2+
3+
## Executive Summary
4+
5+
This report documents the comprehensive evaluation of Google's Gemini Flash 2.5 model using OpenEvolve to optimize code across 8 AlgoTune benchmark tasks. The experiment ran for 114.6 minutes with a 100% success rate, discovering significant algorithmic improvements in 2 out of 8 tasks, including a remarkable 189.94x speedup for 2D convolution operations.
6+
7+
## Experiment Configuration
8+
9+
### Model Settings
10+
- **Model**: Google Gemini Flash 2.5 (`google/gemini-2.5-flash`)
11+
- **Temperature**: 0.4 (optimal based on prior tuning)
12+
- **Max Tokens**: 16,000
13+
- **Evolution Strategy**: Diff-based evolution
14+
- **API Provider**: OpenRouter
15+
16+
### Evolution Parameters
17+
- **Iterations per task**: 100
18+
- **Checkpoint interval**: Every 10 iterations
19+
- **Population size**: 1,000 programs
20+
- **Number of islands**: 4 (for diversity)
21+
- **Migration interval**: Every 20 generations
22+
23+
### Evaluation Settings
24+
- **Cascade evaluation**: Enabled with 3 stages
25+
- **Stage 2 timeout**: 200 seconds
26+
- **Number of trials**: 5 test cases per evaluation
27+
- **Timing runs**: 3 runs + 1 warmup per trial
28+
- **Total executions per evaluation**: 16
29+
30+
## Critical Issue and Resolution
31+
32+
### The Data Size Problem
33+
Initially, all tasks were timing out during Stage 2 evaluation despite individual runs taking only ~60 seconds. Investigation revealed:
34+
35+
- **Root cause**: Each evaluation actually performs 16 executions (5 trials × 3 timing runs + warmup)
36+
- **Original calculation**: 60 seconds × 16 = 960 seconds > 200-second timeout
37+
- **Solution**: Reduced data_size parameters by factor of ~16
38+
39+
### Adjusted Data Sizes
40+
| Task | Original | Adjusted | Reduction Factor |
41+
|------|----------|----------|-----------------|
42+
| affine_transform_2d | 2000 | 100 | 20x |
43+
| convolve2d_full_fill | 20 | 5 | 4x |
44+
| eigenvectors_complex | 400 | 25 | 16x |
45+
| fft_cmplx_scipy_fftpack | 1500 | 95 | 15.8x |
46+
| fft_convolution | 2000 | 125 | 16x |
47+
| lu_factorization | 400 | 25 | 16x |
48+
| polynomial_real | 8000 | 500 | 16x |
49+
| psd_cone_projection | 600 | 35 | 17.1x |
50+
51+
## Results Overview
52+
53+
### Performance Summary
54+
| Task | Speedup | Combined Score | Runtime (s) | Status |
55+
|------|---------|----------------|-------------|---------|
56+
| convolve2d_full_fill | **189.94x** 🚀 | 0.955 | 643.2 ||
57+
| psd_cone_projection | **2.37x** 🔥 | 0.975 | 543.5 ||
58+
| eigenvectors_complex | 1.074x | 0.974 | 1213.2 ||
59+
| lu_factorization | 1.062x | 0.987 | 727.9 ||
60+
| affine_transform_2d | 1.053x | 0.939 | 577.5 ||
61+
| polynomial_real | 1.036x | 0.801 | 2181.3 ||
62+
| fft_cmplx_scipy_fftpack | 1.017x | 0.984 | 386.5 ||
63+
| fft_convolution | 1.014x | 0.987 | 605.6 ||
64+
65+
### Key Metrics
66+
- **Total runtime**: 114.6 minutes
67+
- **Success rate**: 100% (8/8 tasks)
68+
- **Tasks with significant optimization**: 2/8 (25%)
69+
- **Tasks with minor improvements**: 6/8 (75%)
70+
- **Average time per task**: 14.3 minutes
71+
72+
## Detailed Analysis of Optimizations
73+
74+
### 1. convolve2d_full_fill - 189.94x Speedup (Major Success)
75+
76+
**Original Implementation:**
77+
```python
78+
def solve(self, problem):
79+
a, b = problem
80+
result = signal.convolve2d(a, b, mode=self.mode, boundary=self.boundary)
81+
return result
82+
```
83+
84+
**Evolved Implementation:**
85+
```python
86+
def solve(self, problem):
87+
a_in, b_in = problem
88+
# Ensure inputs are float64 and C-contiguous for optimal performance with FFT
89+
a = a_in if a_in.flags['C_CONTIGUOUS'] and a_in.dtype == np.float64 else np.ascontiguousarray(a_in, dtype=np.float64)
90+
b = b_in if b_in.flags['C_CONTIGUOUS'] and b_in.dtype == np.float64 else np.ascontiguousarray(b_in, dtype=np.float64)
91+
result = signal.fftconvolve(a, b, mode=self.mode)
92+
return result
93+
```
94+
95+
**Key Optimizations:**
96+
- **Algorithmic change**: Switched from `convolve2d` (O(n⁴)) to `fftconvolve` (O(n²log n))
97+
- **Memory optimization**: Ensured C-contiguous memory layout for FFT efficiency
98+
- **Type optimization**: Explicit float64 dtype for numerical stability
99+
100+
### 2. psd_cone_projection - 2.37x Speedup (Moderate Success)
101+
102+
**Original Implementation:**
103+
```python
104+
def solve(self, problem):
105+
A = problem["matrix"]
106+
# Standard eigendecomposition
107+
eigvals, eigvecs = np.linalg.eig(A)
108+
eigvals = np.maximum(eigvals, 0)
109+
X = eigvecs @ np.diag(eigvals) @ eigvecs.T
110+
return {"projection": X}
111+
```
112+
113+
**Evolved Implementation:**
114+
```python
115+
def solve(self, problem):
116+
A = problem["matrix"]
117+
# Use eigh for symmetric matrices for better performance and numerical stability
118+
eigvals, eigvecs = np.linalg.eigh(A)
119+
# Clip negative eigenvalues to zero
120+
eigvals = np.maximum(eigvals, 0)
121+
# Optimized matrix multiplication: multiply eigvecs with eigvals first
122+
X = (eigvecs * eigvals) @ eigvecs.T
123+
return {"projection": X}
124+
```
125+
126+
**Key Optimizations:**
127+
- **Specialized function**: Used `eigh` instead of `eig` for symmetric matrices
128+
- **Optimized multiplication**: Changed from `eigvecs @ np.diag(eigvals) @ eigvecs.T` to `(eigvecs * eigvals) @ eigvecs.T`
129+
- **Better numerical stability**: `eigh` guarantees real eigenvalues for symmetric matrices
130+
131+
### 3. Minor Optimizations (1.01x - 1.07x Speedup)
132+
133+
**affine_transform_2d (1.053x):**
134+
```python
135+
# Original
136+
image = problem["image"]
137+
matrix = problem["matrix"]
138+
139+
# Evolved
140+
image = np.asarray(problem["image"], dtype=float)
141+
matrix = np.asarray(problem["matrix"], dtype=float)
142+
```
143+
- Added explicit type conversion to avoid runtime type checking
144+
145+
**Other tasks** showed no visible code changes, suggesting:
146+
- Speedups likely due to measurement variance
147+
- Minor internal optimizations not visible in source
148+
- Statistical noise in timing measurements
149+
150+
## What Worked Well
151+
152+
### 1. Evolution Discovery Capabilities
153+
- Successfully discovered FFT-based convolution optimization (189x speedup)
154+
- Found specialized functions for symmetric matrices (2.37x speedup)
155+
- Identified memory layout optimizations
156+
157+
### 2. Configuration Optimizations
158+
- Diff-based evolution worked better than full rewrites for Gemini
159+
- Temperature 0.4 provided good balance between exploration and exploitation
160+
- Island-based evolution maintained diversity
161+
162+
### 3. System Robustness
163+
- 100% task completion rate after data size adjustment
164+
- No crashes or critical failures
165+
- Checkpoint system allowed progress tracking
166+
167+
## What Didn't Work
168+
169+
### 1. Limited Optimization Discovery
170+
- 6 out of 8 tasks showed minimal improvements (<7%)
171+
- Most baseline implementations were already near-optimal
172+
- Evolution struggled to find improvements for already-optimized code
173+
174+
### 2. Initial Configuration Issues
175+
- Original data_size values caused timeouts
176+
- Required manual intervention to adjust parameters
177+
- Cascade evaluation timing wasn't initially accounted for
178+
179+
### 3. Minor Perturbations vs Real Optimizations
180+
- Many "improvements" were just measurement noise
181+
- Small type conversions counted as optimizations
182+
- Difficult to distinguish real improvements from variance
183+
184+
## Lessons Learned
185+
186+
### 1. Evaluation Complexity
187+
- Must account for total execution count (trials × runs × warmup)
188+
- Cascade evaluation adds significant overhead
189+
- Timeout settings need careful calibration
190+
191+
### 2. Baseline Quality Matters
192+
- Well-optimized baselines leave little room for improvement
193+
- AlgoTune baselines already use efficient libraries (scipy, numpy)
194+
- Major improvements only possible with algorithmic changes
195+
196+
### 3. Evolution Effectiveness
197+
- Works best when alternative algorithms exist (convolve2d → fftconvolve)
198+
- Can find specialized functions (eig → eigh)
199+
- Struggles with micro-optimizations
200+
201+
## Recommendations for Future Experiments
202+
203+
### 1. Task Selection
204+
- Include tasks with known suboptimal baseline implementations
205+
- Add problems where multiple algorithmic approaches exist
206+
- Consider more complex optimization scenarios
207+
208+
### 2. Configuration Tuning
209+
- Pre-calculate total execution time for data sizing
210+
- Consider reducing trials/runs for faster iteration
211+
- Adjust timeout based on actual execution patterns
212+
213+
### 3. Model Comparison Setup
214+
For comparing with other models (e.g., Claude, GPT-4):
215+
- Use identical configuration parameters
216+
- Run on same hardware for fair comparison
217+
- Track both speedup and code quality metrics
218+
- Document any model-specific adjustments needed
219+
220+
## Conclusion
221+
222+
The Gemini Flash 2.5 experiment demonstrated OpenEvolve's capability to discover significant algorithmic improvements when they exist. The system achieved a 189.94x speedup on 2D convolution by automatically discovering FFT-based methods and a 2.37x speedup on PSD projection through specialized matrix operations.
223+
224+
However, the experiment also revealed that for well-optimized baseline implementations, evolution produces minimal improvements. The 25% success rate for finding meaningful optimizations suggests that careful task selection is crucial for demonstrating evolutionary code optimization effectiveness.
225+
226+
### Next Steps
227+
1. Run identical benchmark with alternative LLM models
228+
2. Compare optimization discovery rates across models
229+
3. Analyze code quality and correctness across different models
230+
4. Document model-specific strengths and weaknesses
231+
232+
---
233+
234+
**Experiment Details:**
235+
- Date: August 14, 2025
236+
- Duration: 114.6 minutes
237+
- Hardware: MacOS (Darwin 24.5.0)
238+
- OpenEvolve Version: Current main branch
239+
- API Provider: OpenRouter

0 commit comments

Comments
 (0)