Skip to content

Commit accd01c

Browse files
committed
Refactor prompt-dataset config matching and add emotion benchmark
Updated the evaluator to automatically match prompt files with their corresponding dataset configuration using a naming convention. Added emotion classification benchmark files (`emotion_prompt.txt`, `emotion_prompt_dataset.yaml`) and a wrapper script (`run_evolution.sh`) for easier execution. Deprecated and removed old example files, and improved documentation in the README to reflect the new workflow and dataset handling.
1 parent b7a9d82 commit accd01c

File tree

11 files changed

+150
-76
lines changed

11 files changed

+150
-76
lines changed

examples/llm_prompt_optimization/README.md

Lines changed: 44 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ OpenEvolve automatically:
1010
- Uses cascading evaluation for efficiency
1111
- Finds optimal prompts for your specific task and model
1212

13-
The system uses a clean YAML format for configuration, making it easy to set up prompt optimization for any dataset.
13+
**Key Feature**: The evaluator automatically matches prompt files with dataset configurations using a naming convention (`xxx_prompt.txt``xxx_prompt_dataset.yaml`), making it easy to manage multiple benchmark tasks.
1414

1515
## 🚀 Quick Start
1616

@@ -36,52 +36,74 @@ llm:
3636
3737
### 3. Set Up Your Dataset and Prompt
3838
39-
Configure your dataset in `dataset.yaml`:
39+
This example uses a naming convention to match prompts with their dataset configurations:
40+
- For a prompt file `xxx_prompt.txt`, create a matching `xxx_prompt_dataset.yaml`
41+
- For example: `emotion_prompt.txt` uses `emotion_prompt_dataset.yaml`
42+
43+
Create your dataset configuration file (e.g., `emotion_prompt_dataset.yaml`):
4044

4145
```yaml
4246
# HuggingFace dataset configuration
43-
dataset_name: "stanfordnlp/imdb" # Any HuggingFace dataset
47+
dataset_name: "dair-ai/emotion" # Any HuggingFace dataset
4448
input_field: "text" # Field containing input data
4549
target_field: "label" # Field containing ground truth
4650
split: "test" # Dataset split to use
4751
4852
# Evaluation samples
49-
max_samples: 50 # Number of samples to evaluate
53+
max_samples: 200 # Number of samples to evaluate
5054
```
5155

52-
Create your initial prompt in `initial_prompt.txt`:
56+
Create your initial prompt file (e.g., `emotion_prompt.txt`):
5357

5458
```
55-
Your initial prompt here with {input_text} as placeholder
59+
Classify the emotion expressed in the following text.
60+
61+
Text: "{input_text}"
62+
63+
Emotion (0-5):
5664
```
5765

5866
### 4. Run OpenEvolve
5967

68+
Use the provided `run_evolution.sh` script to ensure the correct dataset is used:
69+
6070
```bash
61-
python ../../openevolve-run.py initial_prompt.txt evaluator.py --config config.yaml --iterations 100
71+
# For emotion classification benchmark
72+
./run_evolution.sh emotion_prompt.txt --iterations 50
73+
74+
# For IMDB sentiment analysis
75+
./run_evolution.sh initial_prompt.txt --iterations 50
76+
77+
# With custom iterations and checkpoint
78+
./run_evolution.sh emotion_prompt.txt --iterations 100 --checkpoint-interval 20
6279
```
6380

64-
The system will:
65-
- Evolve the prompt in `initial_prompt.txt`
66-
- Use dataset configuration from `dataset.yaml`
67-
- Test evolved prompts against the HuggingFace dataset
81+
The script automatically:
82+
- Sets the `OPENEVOLVE_PROMPT` environment variable so the evaluator knows which dataset to use
83+
- Passes all additional arguments to OpenEvolve
84+
- Ensures the correct `_dataset.yaml` file is matched with your prompt
85+
86+
**Note**: If you prefer to run OpenEvolve directly, set the environment variable first:
87+
```bash
88+
export OPENEVOLVE_PROMPT=emotion_prompt.txt
89+
python ../../openevolve-run.py emotion_prompt.txt evaluator.py --config config.yaml --iterations 50
90+
```
6891

6992
## 📊 Supported Datasets
7093

71-
This optimizer works with any HuggingFace dataset. Example configurations are provided in the `examples/` directory:
94+
This optimizer works with any HuggingFace dataset. Included examples:
7295

73-
- **AG News**: `ag_news_dataset.yaml` + `ag_news_prompt.txt`
74-
- **Emotion**: `emotion_dataset.yaml` + `emotion_prompt.txt`
96+
- **IMDB Sentiment**: `initial_prompt.txt` + `initial_prompt_dataset.yaml` (binary classification)
97+
- **Emotion**: `emotion_prompt.txt` + `emotion_prompt_dataset.yaml` (6-class, benchmark against DSPy)
7598

76-
To use an example:
77-
```bash
78-
# Copy the example files
79-
cp examples/ag_news_dataset.yaml dataset.yaml
80-
cp examples/ag_news_prompt.txt initial_prompt.txt
99+
### Creating New Tasks
81100

82-
# Run optimization
83-
python ../../openevolve-run.py initial_prompt.txt evaluator.py --config config.yaml --iterations 100
84-
```
101+
To add a new dataset:
102+
1. Create `yourtask_prompt.txt` with the initial prompt
103+
2. Create `yourtask_prompt_dataset.yaml` with the dataset configuration
104+
3. Run: `./run_evolution.sh yourtask_prompt.txt --iterations 50`
105+
106+
**Note**: If you call OpenEvolve directly without the wrapper script, the evaluator will look for a default `dataset_config.yaml` file.
85107

86108
### Common Dataset Configurations:
87109

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Default dataset configuration (fallback when not using run_evolution.sh)
2+
# This is used when OpenEvolve is called directly without setting OPENEVOLVE_PROMPT
3+
dataset_name: "stanfordnlp/imdb"
4+
input_field: "text"
5+
target_field: "label" # 0 or 1
6+
split: "test"
7+
8+
# Evaluation samples
9+
max_samples: 50
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
Classify the emotion expressed in the following text.
2+
3+
Text: "{input_text}"
4+
5+
Emotion (0-5):
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# HuggingFace dataset configuration for emotion classification
2+
# This is a standard benchmark used by DSPy and others
3+
dataset_name: "dair-ai/emotion"
4+
input_field: "text"
5+
target_field: "label" # 0-5: sadness, joy, love, anger, fear, surprise
6+
split: "test"
7+
8+
# Evaluation samples
9+
max_samples: 200 # Larger sample for 6-class problem
10+
11+
# Labels mapping for reference
12+
label_names:
13+
0: "sadness"
14+
1: "joy"
15+
2: "love"
16+
3: "anger"
17+
4: "fear"
18+
5: "surprise"

examples/llm_prompt_optimization/evaluator.py

Lines changed: 57 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -36,18 +36,32 @@
3636
test_model = OpenAI(base_url=api_base)
3737
print(f"Initialized OpenAI client with model: {TASK_MODEL_NAME}")
3838

39+
# Determine which dataset to use based on the OPENEVOLVE_PROMPT environment variable
40+
import sys
41+
prompt_file = os.environ.get('OPENEVOLVE_PROMPT')
42+
if not prompt_file:
43+
# Default to a generic dataset config if not using the wrapper script
44+
evaluator_dir = os.path.dirname(os.path.abspath(__file__))
45+
DATASET_CONFIG_PATH = os.path.join(evaluator_dir, 'dataset_config.yaml')
46+
print("Warning: OPENEVOLVE_PROMPT not set. Using default dataset_config.yaml")
47+
else:
48+
basename = os.path.basename(prompt_file)
49+
dataset_filename = basename.replace('_prompt.txt', '_prompt_dataset.yaml').replace('.txt', '_dataset.yaml')
50+
evaluator_dir = os.path.dirname(os.path.abspath(__file__))
51+
DATASET_CONFIG_PATH = os.path.join(evaluator_dir, dataset_filename)
52+
print(f"Dataset configuration: {dataset_filename}")
53+
3954
def load_prompt_config(prompt_path):
40-
"""Load the prompt from text file and dataset config from dataset.yaml."""
55+
"""Load the prompt from text file and dataset config from matching _dataset.yaml file."""
4156
# Load prompt from text file
4257
with open(prompt_path, 'r') as f:
4358
prompt = f.read().strip()
4459

45-
# Always load dataset configuration from the examples directory
46-
# This ensures it works even when OpenEvolve copies files to temp directories
47-
evaluator_dir = os.path.dirname(os.path.abspath(__file__))
48-
config_path = os.path.join(evaluator_dir, 'dataset.yaml')
60+
# Load the configuration (already determined from environment variable)
61+
if not os.path.exists(DATASET_CONFIG_PATH):
62+
raise FileNotFoundError(f"Dataset configuration not found: {DATASET_CONFIG_PATH}")
4963

50-
with open(config_path, 'r') as f:
64+
with open(DATASET_CONFIG_PATH, 'r') as f:
5165
config = yaml.safe_load(f)
5266

5367
return config, prompt
@@ -75,6 +89,9 @@ def evaluate_prompt(prompt, dataset, config, num_samples):
7589
input_field = config['input_field']
7690
target_field = config['target_field']
7791

92+
# Check if this is emotion classification (0-5) or sentiment (0-1)
93+
is_emotion = 'emotion' in config.get('dataset_name', '').lower()
94+
7895
# Sample from dataset
7996
samples = dataset.select(range(min(num_samples, len(dataset))))
8097

@@ -97,7 +114,7 @@ def evaluate_prompt(prompt, dataset, config, num_samples):
97114
model=TASK_MODEL_NAME,
98115
messages=messages,
99116
temperature=0.1, # Low temperature for consistent classification
100-
max_tokens=10 # We only need a short response
117+
max_tokens=20 # Allow slightly more tokens for emotion labels
101118
)
102119
break
103120
except Exception as e:
@@ -133,19 +150,41 @@ def evaluate_prompt(prompt, dataset, config, num_samples):
133150

134151
# Extract prediction from output
135152
try:
136-
# Look for a number (0 or 1) in the output
137-
numbers = re.findall(r'\b[01]\b', output_text)
138-
if numbers:
139-
prediction = int(numbers[-1]) # Use the last number found
153+
if is_emotion:
154+
# For emotion classification (0-5)
155+
numbers = re.findall(r'\b[0-5]\b', output_text)
156+
if numbers:
157+
prediction = int(numbers[-1]) # Use the last number found
158+
else:
159+
# Try to infer from emotion keywords
160+
output_lower = output_text.lower()
161+
emotion_map = {
162+
'sadness': 0, 'sad': 0,
163+
'joy': 1, 'happy': 1, 'happiness': 1,
164+
'love': 2,
165+
'anger': 3, 'angry': 3,
166+
'fear': 4, 'afraid': 4, 'scared': 4,
167+
'surprise': 5, 'surprised': 5
168+
}
169+
prediction = -1
170+
for emotion, label in emotion_map.items():
171+
if emotion in output_lower:
172+
prediction = label
173+
break
140174
else:
141-
# Try to infer from keywords
142-
output_lower = output_text.lower()
143-
if 'positive' in output_lower:
144-
prediction = 1
145-
elif 'negative' in output_lower:
146-
prediction = 0
175+
# For sentiment classification (0-1)
176+
numbers = re.findall(r'\b[01]\b', output_text)
177+
if numbers:
178+
prediction = int(numbers[-1]) # Use the last number found
147179
else:
148-
prediction = -1 # Invalid prediction
180+
# Try to infer from keywords
181+
output_lower = output_text.lower()
182+
if 'positive' in output_lower:
183+
prediction = 1
184+
elif 'negative' in output_lower:
185+
prediction = 0
186+
else:
187+
prediction = -1 # Invalid prediction
149188

150189
if prediction == expected:
151190
correct += 1

examples/llm_prompt_optimization/examples/ag_news_dataset.yaml

Lines changed: 0 additions & 8 deletions
This file was deleted.

examples/llm_prompt_optimization/examples/ag_news_prompt.txt

Lines changed: 0 additions & 9 deletions
This file was deleted.

examples/llm_prompt_optimization/examples/emotion_dataset.yaml

Lines changed: 0 additions & 8 deletions
This file was deleted.

examples/llm_prompt_optimization/examples/emotion_prompt.txt

Lines changed: 0 additions & 11 deletions
This file was deleted.

0 commit comments

Comments
 (0)