algorithmicsuperintelligence
diff --git a/‎examples/llm_prompt_optimization/README.md‎
Lines changed: 44 additions & 22 deletions b/‎examples/llm_prompt_optimization/README.md‎
Lines changed: 44 additions & 22 deletions
diff --git a/‎examples/llm_prompt_optimization/dataset_config.yaml‎
Lines changed: 9 additions & 0 deletions b/‎examples/llm_prompt_optimization/dataset_config.yaml‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎examples/llm_prompt_optimization/emotion_prompt.txt‎
Lines changed: 5 additions & 0 deletions b/‎examples/llm_prompt_optimization/emotion_prompt.txt‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎examples/llm_prompt_optimization/emotion_prompt_dataset.yaml‎
Lines changed: 18 additions & 0 deletions b/‎examples/llm_prompt_optimization/emotion_prompt_dataset.yaml‎
Lines changed: 18 additions & 0 deletions
diff --git a/‎examples/llm_prompt_optimization/evaluator.py‎
Lines changed: 57 additions & 18 deletions b/‎examples/llm_prompt_optimization/evaluator.py‎
Lines changed: 57 additions & 18 deletions
diff --git a/‎examples/llm_prompt_optimization/examples/ag_news_dataset.yaml‎
Lines changed: 0 additions & 8 deletions b/‎examples/llm_prompt_optimization/examples/ag_news_dataset.yaml‎
Lines changed: 0 additions & 8 deletions
diff --git a/‎examples/llm_prompt_optimization/examples/ag_news_prompt.txt‎
Lines changed: 0 additions & 9 deletions b/‎examples/llm_prompt_optimization/examples/ag_news_prompt.txt‎
Lines changed: 0 additions & 9 deletions
diff --git a/‎examples/llm_prompt_optimization/examples/emotion_dataset.yaml‎
Lines changed: 0 additions & 8 deletions b/‎examples/llm_prompt_optimization/examples/emotion_dataset.yaml‎
Lines changed: 0 additions & 8 deletions
diff --git a/‎examples/llm_prompt_optimization/examples/emotion_prompt.txt‎
Lines changed: 0 additions & 11 deletions b/‎examples/llm_prompt_optimization/examples/emotion_prompt.txt‎
Lines changed: 0 additions & 11 deletions
diff --git a/‎examples/llm_prompt_optimization/dataset.yaml‎ renamed to ‎examples/llm_prompt_optimization/initial_prompt_dataset.yaml‎ b/‎examples/llm_prompt_optimization/dataset.yaml‎ renamed to ‎examples/llm_prompt_optimization/initial_prompt_dataset.yaml‎
@@ -10,7 +10,7 @@ OpenEvolve automatically:
 - Uses cascading evaluation for efficiency
 - Finds optimal prompts for your specific task and model
 
-The system uses a clean YAML format for configuration, making it easy to set up prompt optimization for any dataset.
+**Key Feature**: The evaluator automatically matches prompt files with dataset configurations using a naming convention (`xxx_prompt.txt` → `xxx_prompt_dataset.yaml`), making it easy to manage multiple benchmark tasks.
 
 ## 🚀 Quick Start
 
@@ -36,52 +36,74 @@ llm:
 
 ### 3. Set Up Your Dataset and Prompt
 
-Configure your dataset in `dataset.yaml`:
+This example uses a naming convention to match prompts with their dataset configurations:
+- For a prompt file `xxx_prompt.txt`, create a matching `xxx_prompt_dataset.yaml`
+- For example: `emotion_prompt.txt` uses `emotion_prompt_dataset.yaml`
+
+Create your dataset configuration file (e.g., `emotion_prompt_dataset.yaml`):
 
 ```yaml
 # HuggingFace dataset configuration
-dataset_name: "stanfordnlp/imdb"  # Any HuggingFace dataset
+dataset_name: "dair-ai/emotion"   # Any HuggingFace dataset
 input_field: "text"               # Field containing input data
 target_field: "label"             # Field containing ground truth
 split: "test"                     # Dataset split to use
 
 # Evaluation samples
-max_samples: 50    # Number of samples to evaluate
+max_samples: 200   # Number of samples to evaluate
 ```
 
-Create your initial prompt in `initial_prompt.txt`:
+Create your initial prompt file (e.g., `emotion_prompt.txt`):
 
 ```
-Your initial prompt here with {input_text} as placeholder
+Classify the emotion expressed in the following text.
+
+Text: "{input_text}"
+
+Emotion (0-5):
 ```
 
 ### 4. Run OpenEvolve
 
+Use the provided `run_evolution.sh` script to ensure the correct dataset is used:
+
 ```bash
-python ../../openevolve-run.py initial_prompt.txt evaluator.py --config config.yaml --iterations 100
+# For emotion classification benchmark
+./run_evolution.sh emotion_prompt.txt --iterations 50
+
+# For IMDB sentiment analysis
+./run_evolution.sh initial_prompt.txt --iterations 50
+
+# With custom iterations and checkpoint
+./run_evolution.sh emotion_prompt.txt --iterations 100 --checkpoint-interval 20
 ```
 
-The system will:
-- Evolve the prompt in `initial_prompt.txt`
-- Use dataset configuration from `dataset.yaml`
-- Test evolved prompts against the HuggingFace dataset
+The script automatically:
+- Sets the `OPENEVOLVE_PROMPT` environment variable so the evaluator knows which dataset to use
+- Passes all additional arguments to OpenEvolve
+- Ensures the correct `_dataset.yaml` file is matched with your prompt
+
+**Note**: If you prefer to run OpenEvolve directly, set the environment variable first:
+```bash
+export OPENEVOLVE_PROMPT=emotion_prompt.txt
+python ../../openevolve-run.py emotion_prompt.txt evaluator.py --config config.yaml --iterations 50
+```
 
 ## 📊 Supported Datasets
 
-This optimizer works with any HuggingFace dataset. Example configurations are provided in the `examples/` directory:
+This optimizer works with any HuggingFace dataset. Included examples:
 
-- **AG News**: `ag_news_dataset.yaml` + `ag_news_prompt.txt`
-- **Emotion**: `emotion_dataset.yaml` + `emotion_prompt.txt`
+- **IMDB Sentiment**: `initial_prompt.txt` + `initial_prompt_dataset.yaml` (binary classification)
+- **Emotion**: `emotion_prompt.txt` + `emotion_prompt_dataset.yaml` (6-class, benchmark against DSPy)
 
-To use an example:
-```bash
-# Copy the example files
-cp examples/ag_news_dataset.yaml dataset.yaml
-cp examples/ag_news_prompt.txt initial_prompt.txt
+### Creating New Tasks
 
-# Run optimization
-python ../../openevolve-run.py initial_prompt.txt evaluator.py --config config.yaml --iterations 100
-```
+To add a new dataset:
+1. Create `yourtask_prompt.txt` with the initial prompt
+2. Create `yourtask_prompt_dataset.yaml` with the dataset configuration
+3. Run: `./run_evolution.sh yourtask_prompt.txt --iterations 50`
+
+**Note**: If you call OpenEvolve directly without the wrapper script, the evaluator will look for a default `dataset_config.yaml` file.
 
 ### Common Dataset Configurations:
 
 
@@ -0,0 +1,9 @@
+# Default dataset configuration (fallback when not using run_evolution.sh)
+# This is used when OpenEvolve is called directly without setting OPENEVOLVE_PROMPT
+dataset_name: "stanfordnlp/imdb"
+input_field: "text"
+target_field: "label"  # 0 or 1
+split: "test"
+
+# Evaluation samples
+max_samples: 50
@@ -0,0 +1,5 @@
+Classify the emotion expressed in the following text.
+
+Text: "{input_text}"
+
+Emotion (0-5):
@@ -0,0 +1,18 @@
+# HuggingFace dataset configuration for emotion classification
+# This is a standard benchmark used by DSPy and others
+dataset_name: "dair-ai/emotion"
+input_field: "text"
+target_field: "label"  # 0-5: sadness, joy, love, anger, fear, surprise
+split: "test"
+
+# Evaluation samples
+max_samples: 200  # Larger sample for 6-class problem
+
+# Labels mapping for reference
+label_names:
+  0: "sadness"
+  1: "joy" 
+  2: "love"
+  3: "anger"
+  4: "fear"
+  5: "surprise"
@@ -36,18 +36,32 @@
 test_model = OpenAI(base_url=api_base)
 print(f"Initialized OpenAI client with model: {TASK_MODEL_NAME}")
 
+# Determine which dataset to use based on the OPENEVOLVE_PROMPT environment variable
+import sys
+prompt_file = os.environ.get('OPENEVOLVE_PROMPT')
+if not prompt_file:
+    # Default to a generic dataset config if not using the wrapper script
+    evaluator_dir = os.path.dirname(os.path.abspath(__file__))
+    DATASET_CONFIG_PATH = os.path.join(evaluator_dir, 'dataset_config.yaml')
+    print("Warning: OPENEVOLVE_PROMPT not set. Using default dataset_config.yaml")
+else:
+    basename = os.path.basename(prompt_file)
+    dataset_filename = basename.replace('_prompt.txt', '_prompt_dataset.yaml').replace('.txt', '_dataset.yaml')
+    evaluator_dir = os.path.dirname(os.path.abspath(__file__))
+    DATASET_CONFIG_PATH = os.path.join(evaluator_dir, dataset_filename)
+    print(f"Dataset configuration: {dataset_filename}")
+
 def load_prompt_config(prompt_path):
-    """Load the prompt from text file and dataset config from dataset.yaml."""
+    """Load the prompt from text file and dataset config from matching _dataset.yaml file."""
     # Load prompt from text file
     with open(prompt_path, 'r') as f:
         prompt = f.read().strip()
 
-    # Always load dataset configuration from the examples directory
-    # This ensures it works even when OpenEvolve copies files to temp directories
-    evaluator_dir = os.path.dirname(os.path.abspath(__file__))
-    config_path = os.path.join(evaluator_dir, 'dataset.yaml')
+    # Load the configuration (already determined from environment variable)
+    if not os.path.exists(DATASET_CONFIG_PATH):
+        raise FileNotFoundError(f"Dataset configuration not found: {DATASET_CONFIG_PATH}")
 
-    with open(config_path, 'r') as f:
+    with open(DATASET_CONFIG_PATH, 'r') as f:
         config = yaml.safe_load(f)
 
     return config, prompt
@@ -75,6 +89,9 @@ def evaluate_prompt(prompt, dataset, config, num_samples):
     input_field = config['input_field']
     target_field = config['target_field']
 
+    # Check if this is emotion classification (0-5) or sentiment (0-1)
+    is_emotion = 'emotion' in config.get('dataset_name', '').lower()
+    
     # Sample from dataset
     samples = dataset.select(range(min(num_samples, len(dataset))))
 
@@ -97,7 +114,7 @@ def evaluate_prompt(prompt, dataset, config, num_samples):
                     model=TASK_MODEL_NAME,
                     messages=messages,
                     temperature=0.1,  # Low temperature for consistent classification
-                    max_tokens=10  # We only need a short response
+                    max_tokens=20  # Allow slightly more tokens for emotion labels
                 )
                 break
             except Exception as e:
@@ -133,19 +150,41 @@ def evaluate_prompt(prompt, dataset, config, num_samples):
 
         # Extract prediction from output
         try:
-            # Look for a number (0 or 1) in the output
-            numbers = re.findall(r'\b[01]\b', output_text)
-            if numbers:
-                prediction = int(numbers[-1])  # Use the last number found
+            if is_emotion:
+                # For emotion classification (0-5)
+                numbers = re.findall(r'\b[0-5]\b', output_text)
+                if numbers:
+                    prediction = int(numbers[-1])  # Use the last number found
+                else:
+                    # Try to infer from emotion keywords
+                    output_lower = output_text.lower()
+                    emotion_map = {
+                        'sadness': 0, 'sad': 0,
+                        'joy': 1, 'happy': 1, 'happiness': 1,
+                        'love': 2,
+                        'anger': 3, 'angry': 3,
+                        'fear': 4, 'afraid': 4, 'scared': 4,
+                        'surprise': 5, 'surprised': 5
+                    }
+                    prediction = -1
+                    for emotion, label in emotion_map.items():
+                        if emotion in output_lower:
+                            prediction = label
+                            break
             else:
-                # Try to infer from keywords
-                output_lower = output_text.lower()
-                if 'positive' in output_lower:
-                    prediction = 1
-                elif 'negative' in output_lower:
-                    prediction = 0
+                # For sentiment classification (0-1)
+                numbers = re.findall(r'\b[01]\b', output_text)
+                if numbers:
+                    prediction = int(numbers[-1])  # Use the last number found
                 else:
-                    prediction = -1  # Invalid prediction
+                    # Try to infer from keywords
+                    output_lower = output_text.lower()
+                    if 'positive' in output_lower:
+                        prediction = 1
+                    elif 'negative' in output_lower:
+                        prediction = 0
+                    else:
+                        prediction = -1  # Invalid prediction
 
             if prediction == expected:
                 correct += 1