Skip to content

Commit 102d419

Browse files
committed
fix
1 parent 490ced2 commit 102d419

File tree

5 files changed

+152
-58
lines changed

5 files changed

+152
-58
lines changed

README.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -377,6 +377,29 @@ database:
377377
correctness: 15 # 15 bins for correctness (from YOUR evaluator)
378378
```
379379

380+
**CRITICAL: Return Raw Values, Not Bin Indices**: For custom feature dimensions, your evaluator must return **raw continuous values**, not pre-computed bin indices. OpenEvolve handles all scaling and binning internally.
381+
382+
```python
383+
# ✅ CORRECT: Return raw values
384+
return {
385+
"combined_score": 0.85,
386+
"prompt_length": 1247, # Actual character count
387+
"execution_time": 0.234 # Raw time in seconds
388+
}
389+
390+
# ❌ WRONG: Don't return bin indices
391+
return {
392+
"combined_score": 0.85,
393+
"prompt_length": 7, # Pre-computed bin index
394+
"execution_time": 3 # Pre-computed bin index
395+
}
396+
```
397+
398+
OpenEvolve automatically handles:
399+
- Min-max scaling to [0,1] range
400+
- Binning into the specified number of bins
401+
- Adaptive scaling as the value range expands during evolution
402+
380403
**Important**: OpenEvolve will raise an error if a specified feature is not found in the evaluator's metrics. This ensures your configuration is correct. The error message will show available metrics to help you fix the configuration.
381404

382405
See the [Configuration Guide](configs/default_config.yaml) for a full list of options.

examples/README.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,56 @@ log_level: "INFO"
133133
❌ **Wrong:** Multiple EVOLVE-BLOCK sections
134134
✅ **Correct:** Exactly one EVOLVE-BLOCK section
135135

136+
## MAP-Elites Feature Dimensions Best Practices
137+
138+
When using custom feature dimensions, your evaluator must return **raw continuous values**, not pre-computed bin indices:
139+
140+
### ✅ Correct: Return Raw Values
141+
```python
142+
def evaluate(program_path: str) -> Dict:
143+
# Calculate actual measurements
144+
prompt_length = len(generated_prompt) # Actual character count
145+
execution_time = measure_runtime() # Time in seconds
146+
memory_usage = get_peak_memory() # Bytes used
147+
148+
return {
149+
"combined_score": accuracy_score,
150+
"prompt_length": prompt_length, # Raw count, not bin index
151+
"execution_time": execution_time, # Raw seconds, not bin index
152+
"memory_usage": memory_usage # Raw bytes, not bin index
153+
}
154+
```
155+
156+
### ❌ Wrong: Return Bin Indices
157+
```python
158+
def evaluate(program_path: str) -> Dict:
159+
prompt_length = len(generated_prompt)
160+
161+
# DON'T DO THIS - pre-computing bins
162+
if prompt_length < 100:
163+
length_bin = 0
164+
elif prompt_length < 500:
165+
length_bin = 1
166+
# ... more binning logic
167+
168+
return {
169+
"combined_score": accuracy_score,
170+
"prompt_length": length_bin, # ❌ This is a bin index, not raw value
171+
}
172+
```
173+
174+
### Why This Matters
175+
- OpenEvolve uses min-max scaling internally
176+
- Bin indices get incorrectly scaled as if they were raw values
177+
- Grid positions become unstable as new programs change the min/max range
178+
- This violates MAP-Elites principles and leads to poor evolution
179+
180+
### Examples of Good Feature Dimensions
181+
- **Counts**: Token count, line count, character count
182+
- **Performance**: Execution time, memory usage, throughput
183+
- **Quality**: Accuracy, precision, recall, F1 score
184+
- **Complexity**: Cyclomatic complexity, nesting depth, function count
185+
136186
## Running Your Example
137187

138188
```bash

examples/llm_prompt_optimization/evaluator.py

Lines changed: 58 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -61,38 +61,28 @@
6161

6262
def calculate_prompt_features(prompt):
6363
"""
64-
Calculate custom features for MAP-Elites binning
64+
Calculate custom features for MAP-Elites
65+
66+
IMPORTANT: Returns raw continuous values, not bin indices.
67+
The database handles all scaling and binning automatically.
6568
6669
Returns:
67-
tuple: (prompt_length, reasoning_strategy) - both in range 0-9
70+
tuple: (prompt_length, reasoning_sophistication_score)
71+
- prompt_length: Actual character count
72+
- reasoning_sophistication_score: Continuous score 0.0-1.0
6873
"""
69-
# Feature 1: Prompt length bin (0-9)
70-
length = len(prompt)
71-
if length < 100:
72-
prompt_length = 0 # Minimal
73-
elif length < 200:
74-
prompt_length = 1 # Very short
75-
elif length < 400:
76-
prompt_length = 2 # Short
77-
elif length < 600:
78-
prompt_length = 3 # Medium-short
79-
elif length < 900:
80-
prompt_length = 4 # Medium
81-
elif length < 1200:
82-
prompt_length = 5 # Medium-long
83-
elif length < 1600:
84-
prompt_length = 6 # Long
85-
elif length < 2000:
86-
prompt_length = 7 # Very long
87-
elif length < 2500:
88-
prompt_length = 8 # Extensive
89-
else:
90-
prompt_length = 9 # Very extensive
74+
# Feature 1: Prompt length (raw character count)
75+
prompt_length = len(prompt)
9176

92-
# Feature 2: Reasoning strategy (0-9)
77+
# Feature 2: Reasoning sophistication score (continuous 0.0-1.0)
9378
prompt_lower = prompt.lower()
79+
sophistication_score = 0.0
80+
81+
# Base scoring
82+
if len(prompt) >= 100:
83+
sophistication_score += 0.1 # Has substantial content
9484

95-
# Check for few-shot examples
85+
# Check for few-shot examples (high sophistication)
9686
has_example = (
9787
"example" in prompt_lower
9888
or prompt.count("####") >= 4
@@ -107,33 +97,40 @@ def calculate_prompt_features(prompt):
10797
or bool(re.search(r"(first|then|next|finally)", prompt_lower))
10898
)
10999

110-
# Assign reasoning strategy bins
100+
# Check for directive language
101+
has_directive = "solve" in prompt_lower or "calculate" in prompt_lower
102+
103+
# Check for strict language
104+
has_strict = "must" in prompt_lower or "exactly" in prompt_lower
105+
106+
# Calculate sophistication score
111107
if has_example:
112-
# Few-shot examples (bins 7-9)
108+
sophistication_score += 0.6 # Few-shot examples are sophisticated
113109
if has_cot:
114-
reasoning_strategy = 9 # Few-shot + CoT (most sophisticated)
115-
elif length > 1500:
116-
reasoning_strategy = 8 # Extensive few-shot
110+
sophistication_score += 0.3 # Few-shot + CoT is most sophisticated
111+
elif len(prompt) > 1500:
112+
sophistication_score += 0.2 # Extensive few-shot
117113
else:
118-
reasoning_strategy = 7 # Basic few-shot
114+
sophistication_score += 0.1 # Basic few-shot
119115
elif has_cot:
120-
# Chain-of-thought (bins 4-6)
121-
if "must" in prompt_lower or "exactly" in prompt_lower:
122-
reasoning_strategy = 6 # Strict CoT
123-
elif length > 500:
124-
reasoning_strategy = 5 # Detailed CoT
116+
sophistication_score += 0.4 # Chain-of-thought
117+
if has_strict:
118+
sophistication_score += 0.2 # Strict CoT
119+
elif len(prompt) > 500:
120+
sophistication_score += 0.15 # Detailed CoT
125121
else:
126-
reasoning_strategy = 4 # Basic CoT
122+
sophistication_score += 0.1 # Basic CoT
127123
else:
128-
# Basic prompts (bins 0-3)
129-
if length < 100:
130-
reasoning_strategy = 0 # Minimal
131-
elif "solve" in prompt_lower or "calculate" in prompt_lower:
132-
reasoning_strategy = 2 # Direct instruction
124+
# Basic prompts
125+
if has_directive:
126+
sophistication_score += 0.2 # Direct instruction
133127
else:
134-
reasoning_strategy = 1 # Simple prompt
128+
sophistication_score += 0.1 # Simple prompt
129+
130+
# Ensure score is within 0.0-1.0 range
131+
sophistication_score = min(1.0, max(0.0, sophistication_score))
135132

136-
return prompt_length, reasoning_strategy
133+
return prompt_length, sophistication_score
137134

138135

139136
def load_prompt_config(prompt_path):
@@ -492,13 +489,15 @@ def evaluate_stage1(prompt_path):
492489
print("-" * 80)
493490

494491
# Calculate custom features
495-
prompt_length, reasoning_strategy = calculate_prompt_features(prompt)
496-
print(f"Prompt features - Length bin: {prompt_length}, Reasoning bin: {reasoning_strategy}")
492+
prompt_length, reasoning_sophistication = calculate_prompt_features(prompt)
493+
print(
494+
f"Prompt features - Length: {prompt_length} chars, Reasoning sophistication: {reasoning_sophistication:.3f}"
495+
)
497496

498497
return {
499498
"combined_score": accuracy,
500499
"prompt_length": prompt_length,
501-
"reasoning_strategy": reasoning_strategy,
500+
"reasoning_strategy": reasoning_sophistication,
502501
}
503502

504503
except Exception as e:
@@ -511,15 +510,15 @@ def evaluate_stage1(prompt_path):
511510
# Try to calculate features from the failed prompt
512511
with open(prompt_path, "r") as f:
513512
failed_prompt = f.read().strip()
514-
prompt_length, reasoning_strategy = calculate_prompt_features(failed_prompt)
513+
prompt_length, reasoning_sophistication = calculate_prompt_features(failed_prompt)
515514
except:
516515
# Fallback values if prompt can't be read
517-
prompt_length, reasoning_strategy = 0, 0
516+
prompt_length, reasoning_sophistication = 0, 0.0
518517

519518
return {
520519
"combined_score": 0.0,
521520
"prompt_length": prompt_length,
522-
"reasoning_strategy": reasoning_strategy,
521+
"reasoning_strategy": reasoning_sophistication,
523522
"error": str(e),
524523
}
525524

@@ -560,13 +559,15 @@ def evaluate_stage2(prompt_path):
560559
print("-" * 80)
561560

562561
# Calculate custom features
563-
prompt_length, reasoning_strategy = calculate_prompt_features(prompt)
564-
print(f"Prompt features - Length bin: {prompt_length}, Reasoning bin: {reasoning_strategy}")
562+
prompt_length, reasoning_sophistication = calculate_prompt_features(prompt)
563+
print(
564+
f"Prompt features - Length: {prompt_length} chars, Reasoning sophistication: {reasoning_sophistication:.3f}"
565+
)
565566

566567
return {
567568
"combined_score": accuracy,
568569
"prompt_length": prompt_length,
569-
"reasoning_strategy": reasoning_strategy,
570+
"reasoning_strategy": reasoning_sophistication,
570571
}
571572

572573
except Exception as e:
@@ -579,15 +580,15 @@ def evaluate_stage2(prompt_path):
579580
# Try to calculate features from the failed prompt
580581
with open(prompt_path, "r") as f:
581582
failed_prompt = f.read().strip()
582-
prompt_length, reasoning_strategy = calculate_prompt_features(failed_prompt)
583+
prompt_length, reasoning_sophistication = calculate_prompt_features(failed_prompt)
583584
except:
584585
# Fallback values if prompt can't be read
585-
prompt_length, reasoning_strategy = 0, 0
586+
prompt_length, reasoning_sophistication = 0, 0.0
586587

587588
return {
588589
"combined_score": 0.0,
589590
"prompt_length": prompt_length,
590-
"reasoning_strategy": reasoning_strategy,
591+
"reasoning_strategy": reasoning_sophistication,
591592
"error": str(e),
592593
}
593594

openevolve/config.py

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -186,7 +186,19 @@ class DatabaseConfig:
186186

187187
# Feature map dimensions for MAP-Elites
188188
# Default to complexity and diversity for better exploration
189-
feature_dimensions: List[str] = field(default_factory=lambda: ["complexity", "diversity"])
189+
# CRITICAL: For custom dimensions, evaluators must return RAW VALUES, not bin indices
190+
# Built-in: "complexity", "diversity", "score" (always available)
191+
# Custom: Any metric from your evaluator (must be continuous values)
192+
feature_dimensions: List[str] = field(
193+
default_factory=lambda: ["complexity", "diversity"],
194+
metadata={
195+
"help": "List of feature dimensions for MAP-Elites grid. "
196+
"Built-in dimensions: 'complexity', 'diversity', 'score'. "
197+
"Custom dimensions: Must match metric names from evaluator. "
198+
"IMPORTANT: Evaluators must return raw continuous values for custom dimensions, "
199+
"NOT pre-computed bin indices. OpenEvolve handles all scaling and binning internally."
200+
}
201+
)
190202
feature_bins: Union[int, Dict[str, int]] = 10 # Can be int (all dims) or dict (per-dim)
191203
diversity_reference_size: int = 20 # Size of reference set for diversity calculation
192204

openevolve/evaluation_result.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,14 @@ class EvaluationResult:
1414
1515
This maintains backward compatibility with the existing dict[str, float] contract
1616
while adding a side-channel for arbitrary artifacts (text or binary data).
17+
18+
IMPORTANT: For custom MAP-Elites features, metrics values must be raw continuous
19+
scores (e.g., actual counts, percentages, continuous measurements), NOT pre-computed
20+
bin indices. The database handles all binning internally using min-max scaling.
21+
22+
Examples:
23+
✅ Correct: {"combined_score": 0.85, "prompt_length": 1247, "execution_time": 0.234}
24+
❌ Wrong: {"combined_score": 0.85, "prompt_length": 7, "execution_time": 3}
1725
"""
1826

1927
metrics: Dict[str, float] # mandatory - existing contract

0 commit comments

Comments
 (0)