Skip to content

Commit db73436

Browse files
authored
Merge pull request #211 from codelion/fix-algotune-parsing-config
Fix algotune parsing config
2 parents 5352eb9 + 0bd81b5 commit db73436

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+10055
-152
lines changed

README.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,44 @@ OpenEvolve orchestrates a sophisticated evolutionary pipeline:
9191
- Feature map clustering and archive management
9292
- Comprehensive metadata and lineage tracking
9393

94+
### Island-Based Evolution with Worker Pinning
95+
96+
OpenEvolve implements a sophisticated island-based evolutionary architecture that maintains multiple isolated populations to prevent premature convergence and preserve genetic diversity.
97+
98+
#### How Islands Work
99+
100+
- **Multiple Isolated Populations**: Each island maintains its own population of programs that evolve independently
101+
- **Periodic Migration**: Top-performing programs periodically migrate between adjacent islands (ring topology) to share beneficial mutations
102+
- **True Population Isolation**: Worker processes are deterministically pinned to specific islands to ensure no cross-contamination during parallel evolution
103+
104+
#### Worker-to-Island Pinning
105+
106+
To ensure true island isolation during parallel execution, OpenEvolve implements automatic worker-to-island pinning:
107+
108+
```python
109+
# Workers are distributed across islands using modulo arithmetic
110+
worker_id = 0, 1, 2, 3, 4, 5, ...
111+
island_id = worker_id % num_islands
112+
113+
# Example with 3 islands and 6 workers:
114+
# Worker 0, 3 → Island 0
115+
# Worker 1, 4 → Island 1
116+
# Worker 2, 5 → Island 2
117+
```
118+
119+
**Benefits of Worker Pinning**:
120+
- **Genetic Isolation**: Prevents accidental population mixing between islands during parallel sampling
121+
- **Consistent Evolution**: Each island maintains its distinct evolutionary trajectory
122+
- **Balanced Load**: Workers are evenly distributed across islands automatically
123+
- **Migration Integrity**: Controlled migration happens only at designated intervals, not due to race conditions
124+
125+
**Automatic Distribution**: The system handles all edge cases automatically:
126+
- **More workers than islands**: Multiple workers per island with balanced distribution
127+
- **Fewer workers than islands**: Some islands may not have dedicated workers but still participate in migration
128+
- **Single island**: All workers sample from the same population (degrades to standard evolution)
129+
130+
This architecture ensures that each island develops unique evolutionary pressures and solutions, while periodic migration allows successful innovations to spread across the population without destroying diversity.
131+
94132
## Getting Started
95133

96134
### Installation
@@ -377,6 +415,29 @@ database:
377415
correctness: 15 # 15 bins for correctness (from YOUR evaluator)
378416
```
379417

418+
**CRITICAL: Return Raw Values, Not Bin Indices**: For custom feature dimensions, your evaluator must return **raw continuous values**, not pre-computed bin indices. OpenEvolve handles all scaling and binning internally.
419+
420+
```python
421+
# ✅ CORRECT: Return raw values
422+
return {
423+
"combined_score": 0.85,
424+
"prompt_length": 1247, # Actual character count
425+
"execution_time": 0.234 # Raw time in seconds
426+
}
427+
428+
# ❌ WRONG: Don't return bin indices
429+
return {
430+
"combined_score": 0.85,
431+
"prompt_length": 7, # Pre-computed bin index
432+
"execution_time": 3 # Pre-computed bin index
433+
}
434+
```
435+
436+
OpenEvolve automatically handles:
437+
- Min-max scaling to [0,1] range
438+
- Binning into the specified number of bins
439+
- Adaptive scaling as the value range expands during evolution
440+
380441
**Important**: OpenEvolve will raise an error if a specified feature is not found in the evaluator's metrics. This ensures your configuration is correct. The error message will show available metrics to help you fix the configuration.
381442

382443
See the [Configuration Guide](configs/default_config.yaml) for a full list of options.

examples/README.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,56 @@ log_level: "INFO"
133133
❌ **Wrong:** Multiple EVOLVE-BLOCK sections
134134
✅ **Correct:** Exactly one EVOLVE-BLOCK section
135135

136+
## MAP-Elites Feature Dimensions Best Practices
137+
138+
When using custom feature dimensions, your evaluator must return **raw continuous values**, not pre-computed bin indices:
139+
140+
### ✅ Correct: Return Raw Values
141+
```python
142+
def evaluate(program_path: str) -> Dict:
143+
# Calculate actual measurements
144+
prompt_length = len(generated_prompt) # Actual character count
145+
execution_time = measure_runtime() # Time in seconds
146+
memory_usage = get_peak_memory() # Bytes used
147+
148+
return {
149+
"combined_score": accuracy_score,
150+
"prompt_length": prompt_length, # Raw count, not bin index
151+
"execution_time": execution_time, # Raw seconds, not bin index
152+
"memory_usage": memory_usage # Raw bytes, not bin index
153+
}
154+
```
155+
156+
### ❌ Wrong: Return Bin Indices
157+
```python
158+
def evaluate(program_path: str) -> Dict:
159+
prompt_length = len(generated_prompt)
160+
161+
# DON'T DO THIS - pre-computing bins
162+
if prompt_length < 100:
163+
length_bin = 0
164+
elif prompt_length < 500:
165+
length_bin = 1
166+
# ... more binning logic
167+
168+
return {
169+
"combined_score": accuracy_score,
170+
"prompt_length": length_bin, # ❌ This is a bin index, not raw value
171+
}
172+
```
173+
174+
### Why This Matters
175+
- OpenEvolve uses min-max scaling internally
176+
- Bin indices get incorrectly scaled as if they were raw values
177+
- Grid positions become unstable as new programs change the min/max range
178+
- This violates MAP-Elites principles and leads to poor evolution
179+
180+
### Examples of Good Feature Dimensions
181+
- **Counts**: Token count, line count, character count
182+
- **Performance**: Execution time, memory usage, throughput
183+
- **Quality**: Accuracy, precision, recall, F1 score
184+
- **Complexity**: Cyclomatic complexity, nesting depth, function count
185+
136186
## Running Your Example
137187

138188
```bash
Lines changed: 239 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,239 @@
1+
# OpenEvolve AlgoTune Benchmark Report: Gemini Flash 2.5 Experiment
2+
3+
## Executive Summary
4+
5+
This report documents the comprehensive evaluation of Google's Gemini Flash 2.5 model using OpenEvolve to optimize code across 8 AlgoTune benchmark tasks. The experiment ran for 114.6 minutes with a 100% success rate, discovering significant algorithmic improvements in 2 out of 8 tasks, including a remarkable 189.94x speedup for 2D convolution operations.
6+
7+
## Experiment Configuration
8+
9+
### Model Settings
10+
- **Model**: Google Gemini Flash 2.5 (`google/gemini-2.5-flash`)
11+
- **Temperature**: 0.4 (optimal based on prior tuning)
12+
- **Max Tokens**: 16,000
13+
- **Evolution Strategy**: Diff-based evolution
14+
- **API Provider**: OpenRouter
15+
16+
### Evolution Parameters
17+
- **Iterations per task**: 100
18+
- **Checkpoint interval**: Every 10 iterations
19+
- **Population size**: 1,000 programs
20+
- **Number of islands**: 4 (for diversity)
21+
- **Migration interval**: Every 20 generations
22+
23+
### Evaluation Settings
24+
- **Cascade evaluation**: Enabled with 3 stages
25+
- **Stage 2 timeout**: 200 seconds
26+
- **Number of trials**: 5 test cases per evaluation
27+
- **Timing runs**: 3 runs + 1 warmup per trial
28+
- **Total executions per evaluation**: 16
29+
30+
## Critical Issue and Resolution
31+
32+
### The Data Size Problem
33+
Initially, all tasks were timing out during Stage 2 evaluation despite individual runs taking only ~60 seconds. Investigation revealed:
34+
35+
- **Root cause**: Each evaluation actually performs 16 executions (5 trials × 3 timing runs + warmup)
36+
- **Original calculation**: 60 seconds × 16 = 960 seconds > 200-second timeout
37+
- **Solution**: Reduced data_size parameters by factor of ~16
38+
39+
### Adjusted Data Sizes
40+
| Task | Original | Adjusted | Reduction Factor |
41+
|------|----------|----------|-----------------|
42+
| affine_transform_2d | 2000 | 100 | 20x |
43+
| convolve2d_full_fill | 20 | 5 | 4x |
44+
| eigenvectors_complex | 400 | 25 | 16x |
45+
| fft_cmplx_scipy_fftpack | 1500 | 95 | 15.8x |
46+
| fft_convolution | 2000 | 125 | 16x |
47+
| lu_factorization | 400 | 25 | 16x |
48+
| polynomial_real | 8000 | 500 | 16x |
49+
| psd_cone_projection | 600 | 35 | 17.1x |
50+
51+
## Results Overview
52+
53+
### Performance Summary
54+
| Task | Speedup | Combined Score | Runtime (s) | Status |
55+
|------|---------|----------------|-------------|---------|
56+
| convolve2d_full_fill | **189.94x** 🚀 | 0.955 | 643.2 ||
57+
| psd_cone_projection | **2.37x** 🔥 | 0.975 | 543.5 ||
58+
| eigenvectors_complex | 1.074x | 0.974 | 1213.2 ||
59+
| lu_factorization | 1.062x | 0.987 | 727.9 ||
60+
| affine_transform_2d | 1.053x | 0.939 | 577.5 ||
61+
| polynomial_real | 1.036x | 0.801 | 2181.3 ||
62+
| fft_cmplx_scipy_fftpack | 1.017x | 0.984 | 386.5 ||
63+
| fft_convolution | 1.014x | 0.987 | 605.6 ||
64+
65+
### Key Metrics
66+
- **Total runtime**: 114.6 minutes
67+
- **Success rate**: 100% (8/8 tasks)
68+
- **Tasks with significant optimization**: 2/8 (25%)
69+
- **Tasks with minor improvements**: 6/8 (75%)
70+
- **Average time per task**: 14.3 minutes
71+
72+
## Detailed Analysis of Optimizations
73+
74+
### 1. convolve2d_full_fill - 189.94x Speedup (Major Success)
75+
76+
**Original Implementation:**
77+
```python
78+
def solve(self, problem):
79+
a, b = problem
80+
result = signal.convolve2d(a, b, mode=self.mode, boundary=self.boundary)
81+
return result
82+
```
83+
84+
**Evolved Implementation:**
85+
```python
86+
def solve(self, problem):
87+
a_in, b_in = problem
88+
# Ensure inputs are float64 and C-contiguous for optimal performance with FFT
89+
a = a_in if a_in.flags['C_CONTIGUOUS'] and a_in.dtype == np.float64 else np.ascontiguousarray(a_in, dtype=np.float64)
90+
b = b_in if b_in.flags['C_CONTIGUOUS'] and b_in.dtype == np.float64 else np.ascontiguousarray(b_in, dtype=np.float64)
91+
result = signal.fftconvolve(a, b, mode=self.mode)
92+
return result
93+
```
94+
95+
**Key Optimizations:**
96+
- **Algorithmic change**: Switched from `convolve2d` (O(n⁴)) to `fftconvolve` (O(n²log n))
97+
- **Memory optimization**: Ensured C-contiguous memory layout for FFT efficiency
98+
- **Type optimization**: Explicit float64 dtype for numerical stability
99+
100+
### 2. psd_cone_projection - 2.37x Speedup (Moderate Success)
101+
102+
**Original Implementation:**
103+
```python
104+
def solve(self, problem):
105+
A = problem["matrix"]
106+
# Standard eigendecomposition
107+
eigvals, eigvecs = np.linalg.eig(A)
108+
eigvals = np.maximum(eigvals, 0)
109+
X = eigvecs @ np.diag(eigvals) @ eigvecs.T
110+
return {"projection": X}
111+
```
112+
113+
**Evolved Implementation:**
114+
```python
115+
def solve(self, problem):
116+
A = problem["matrix"]
117+
# Use eigh for symmetric matrices for better performance and numerical stability
118+
eigvals, eigvecs = np.linalg.eigh(A)
119+
# Clip negative eigenvalues to zero
120+
eigvals = np.maximum(eigvals, 0)
121+
# Optimized matrix multiplication: multiply eigvecs with eigvals first
122+
X = (eigvecs * eigvals) @ eigvecs.T
123+
return {"projection": X}
124+
```
125+
126+
**Key Optimizations:**
127+
- **Specialized function**: Used `eigh` instead of `eig` for symmetric matrices
128+
- **Optimized multiplication**: Changed from `eigvecs @ np.diag(eigvals) @ eigvecs.T` to `(eigvecs * eigvals) @ eigvecs.T`
129+
- **Better numerical stability**: `eigh` guarantees real eigenvalues for symmetric matrices
130+
131+
### 3. Minor Optimizations (1.01x - 1.07x Speedup)
132+
133+
**affine_transform_2d (1.053x):**
134+
```python
135+
# Original
136+
image = problem["image"]
137+
matrix = problem["matrix"]
138+
139+
# Evolved
140+
image = np.asarray(problem["image"], dtype=float)
141+
matrix = np.asarray(problem["matrix"], dtype=float)
142+
```
143+
- Added explicit type conversion to avoid runtime type checking
144+
145+
**Other tasks** showed no visible code changes, suggesting:
146+
- Speedups likely due to measurement variance
147+
- Minor internal optimizations not visible in source
148+
- Statistical noise in timing measurements
149+
150+
## What Worked Well
151+
152+
### 1. Evolution Discovery Capabilities
153+
- Successfully discovered FFT-based convolution optimization (189x speedup)
154+
- Found specialized functions for symmetric matrices (2.37x speedup)
155+
- Identified memory layout optimizations
156+
157+
### 2. Configuration Optimizations
158+
- Diff-based evolution worked better than full rewrites for Gemini
159+
- Temperature 0.4 provided good balance between exploration and exploitation
160+
- Island-based evolution maintained diversity
161+
162+
### 3. System Robustness
163+
- 100% task completion rate after data size adjustment
164+
- No crashes or critical failures
165+
- Checkpoint system allowed progress tracking
166+
167+
## What Didn't Work
168+
169+
### 1. Limited Optimization Discovery
170+
- 6 out of 8 tasks showed minimal improvements (<7%)
171+
- Most baseline implementations were already near-optimal
172+
- Evolution struggled to find improvements for already-optimized code
173+
174+
### 2. Initial Configuration Issues
175+
- Original data_size values caused timeouts
176+
- Required manual intervention to adjust parameters
177+
- Cascade evaluation timing wasn't initially accounted for
178+
179+
### 3. Minor Perturbations vs Real Optimizations
180+
- Many "improvements" were just measurement noise
181+
- Small type conversions counted as optimizations
182+
- Difficult to distinguish real improvements from variance
183+
184+
## Lessons Learned
185+
186+
### 1. Evaluation Complexity
187+
- Must account for total execution count (trials × runs × warmup)
188+
- Cascade evaluation adds significant overhead
189+
- Timeout settings need careful calibration
190+
191+
### 2. Baseline Quality Matters
192+
- Well-optimized baselines leave little room for improvement
193+
- AlgoTune baselines already use efficient libraries (scipy, numpy)
194+
- Major improvements only possible with algorithmic changes
195+
196+
### 3. Evolution Effectiveness
197+
- Works best when alternative algorithms exist (convolve2d → fftconvolve)
198+
- Can find specialized functions (eig → eigh)
199+
- Struggles with micro-optimizations
200+
201+
## Recommendations for Future Experiments
202+
203+
### 1. Task Selection
204+
- Include tasks with known suboptimal baseline implementations
205+
- Add problems where multiple algorithmic approaches exist
206+
- Consider more complex optimization scenarios
207+
208+
### 2. Configuration Tuning
209+
- Pre-calculate total execution time for data sizing
210+
- Consider reducing trials/runs for faster iteration
211+
- Adjust timeout based on actual execution patterns
212+
213+
### 3. Model Comparison Setup
214+
For comparing with other models (e.g., Claude, GPT-4):
215+
- Use identical configuration parameters
216+
- Run on same hardware for fair comparison
217+
- Track both speedup and code quality metrics
218+
- Document any model-specific adjustments needed
219+
220+
## Conclusion
221+
222+
The Gemini Flash 2.5 experiment demonstrated OpenEvolve's capability to discover significant algorithmic improvements when they exist. The system achieved a 189.94x speedup on 2D convolution by automatically discovering FFT-based methods and a 2.37x speedup on PSD projection through specialized matrix operations.
223+
224+
However, the experiment also revealed that for well-optimized baseline implementations, evolution produces minimal improvements. The 25% success rate for finding meaningful optimizations suggests that careful task selection is crucial for demonstrating evolutionary code optimization effectiveness.
225+
226+
### Next Steps
227+
1. Run identical benchmark with alternative LLM models
228+
2. Compare optimization discovery rates across models
229+
3. Analyze code quality and correctness across different models
230+
4. Document model-specific strengths and weaknesses
231+
232+
---
233+
234+
**Experiment Details:**
235+
- Date: August 14, 2025
236+
- Duration: 114.6 minutes
237+
- Hardware: MacOS (Darwin 24.5.0)
238+
- OpenEvolve Version: Current main branch
239+
- API Provider: OpenRouter

0 commit comments

Comments
 (0)