Skip to content

Commit 64195f0

Browse files
authored
Merge pull request algorithmicsuperintelligence#118 from islammesabah/llm_prompt_optimazation_example
Add example for prompt optimization
2 parents bca23e8 + 0e458a7 commit 64195f0

File tree

8 files changed

+984
-0
lines changed

8 files changed

+984
-0
lines changed
Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# Evolving Better Prompts with OpenEvolve 🧠✨
2+
3+
This example shows how to use **OpenEvolve** to automatically optimize prompts for **Large Language Models (LLMs)**. Whether you're working on classification, summarization, generation, or code tasks, OpenEvolve helps you find high-performing prompts using **evolutionary search**. For this example we'll use syntihetic data for sentiment analysis task, but you can adapt it to your own datasets and tasks.
4+
5+
---
6+
7+
## 🎯 What Is Prompt Optimization?
8+
9+
Prompt engineering is key to getting reliable outputs from LLMs—but finding the right prompt manually can be slow and inconsistent.
10+
11+
OpenEvolve automates this by:
12+
13+
* Generating and evolving prompt variations
14+
* Testing them against your task and metrics
15+
* Selecting the best prompts through generations
16+
17+
You start with a simple prompt and let OpenEvolve evolve it into something smarter and more effective.
18+
19+
---
20+
21+
## 🚀 Getting Started
22+
23+
### 1. Install Dependencies
24+
25+
```bash
26+
cd examples/llm_prompt_optimazation
27+
pip install -r requirements.txt
28+
sh run.sh
29+
```
30+
31+
### 2. Add Your models
32+
33+
1. Update your `config.yaml`:
34+
35+
```yaml
36+
llm:
37+
primary_model: "llm_name"
38+
api_base: "llm_server_url"
39+
api_key: "your_api_key_here"
40+
```
41+
42+
2. Update your task-model in `evaluator.py`:
43+
44+
```python
45+
TASK_MODEL_NAME = "task_llm_name"
46+
TASK_MODEL_URL = "task_llm_server_url"
47+
TASK_MODEL_API_KEY = "your_api_key_here"
48+
SAMPLE_SIZE = 25 # Number of samples to use for evaluation
49+
MAX_RETRIES = 3 # Number of retries for LLM calls
50+
51+
```
52+
53+
### 3. Run OpenEvolve
54+
55+
```bash
56+
sh run.sh
57+
```
58+
59+
---
60+
61+
## 🔧 How to Adapt This Template
62+
63+
### 1. Replace the Dataset
64+
65+
Edit `data.json` to match your use case:
66+
67+
```json
68+
[
69+
{
70+
"id": 1,
71+
"input": "Your input here",
72+
"expected_output": "Target output"
73+
}
74+
]
75+
```
76+
77+
### 2. Customize the Evaluator
78+
79+
In `evaluator.py`, define how to evaluate a prompt:
80+
81+
* Load your data
82+
* Call the LLM using the prompt
83+
* Measure output quality (accuracy, score, etc.)
84+
85+
### 3. Write Your Initial Prompt
86+
87+
Create a basic starting prompt in `initial_prompt.txt`:
88+
89+
```
90+
# EVOLVE-BLOCK-START
91+
Your task prompt using {input_text} as a placeholder.
92+
# EVOLVE-BLOCK-END
93+
```
94+
95+
This is the part OpenEvolve will improve over time.
96+
Good to add the name of your task in 'initial_prompt.txt' header to help the model understand the context.
97+
98+
---
99+
100+
## ⚙️ Key Config Options (`config.yaml`)
101+
102+
```yaml
103+
llm:
104+
primary_model: "gpt-4o" # or your preferred model
105+
secondary_model: "gpt-3.5" # optional for diversity
106+
temperature: 0.9
107+
max_tokens: 2048
108+
109+
database:
110+
population_size: 40
111+
max_iterations: 15
112+
elite_selection_ratio: 0.25
113+
114+
evaluator:
115+
timeout: 45
116+
parallel_evaluations: 3
117+
use_llm_feedback: true
118+
```
119+
120+
---
121+
122+
## 📈 Example Output
123+
124+
OpenEvolve evolves prompts like this:
125+
126+
**Initial Prompt:**
127+
128+
```
129+
Please analyze the sentiment of the following sentence and provide a sentiment score:
130+
131+
"{input_text}"
132+
133+
Rate the sentiment on a scale from 0.0 to 10.0.
134+
135+
Score:
136+
```
137+
138+
**Evolved Prompt:**
139+
140+
```
141+
Please analyze the sentiment of the following sentence and provide a sentiment score using the following guidelines:
142+
- 0.0-2.9: Strongly negative sentiment (e.g., expresses anger, sadness, or despair)
143+
- 3.0-6.9: Neutral or mixed sentiment (e.g., factual statements, ambiguous content)
144+
- 7.0-10.0: Strongly positive sentiment (e.g., expresses joy, satisfaction, or hope)
145+
146+
"{input_text}"
147+
148+
Rate the sentiment on a scale from 0.0 to 10.0:
149+
- 0.0-2.9: Strongly negative (e.g., "This product is terrible")
150+
- 3.0-6.9: Neutral/mixed (e.g., "The sky is blue today")
151+
- 7.0-10.0: Strongly positive (e.g., "This is amazing!")
152+
153+
Provide only the numeric score (e.g., "8.5") without any additional text:
154+
155+
Score:
156+
```
157+
158+
**Result**: Improved accuracy and output consistency.
159+
160+
---
161+
162+
## 🔍 Where to Use This
163+
164+
OpenEvolve could be addapted on many tasks:
165+
166+
* **Text Classification**: Spam detection, intent recognition
167+
* **Content Generation**: Social media posts, product descriptions
168+
* **Question Answering & Summarization**
169+
* **Code Tasks**: Review, generation, completion
170+
* **Structured Output**: JSON, table filling, data extraction
171+
172+
---
173+
174+
## ✅ Best Practices
175+
176+
* Start with a basic but relevant prompt
177+
* Use good-quality data and clear evaluation metrics
178+
* Run multiple evolutions for better results
179+
* Validate on held-out data before deployment
180+
181+
---
182+
183+
**Ready to discover better prompts?**
184+
Use this template to evolve prompts for any LLM task—automatically.
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
"""Sentiment analysis prompt example for OpenEvolve"""
2+
3+
# EVOLVE-BLOCK-START
4+
Please analyze the sentiment of the following sentence and provide a sentiment score using the following guidelines:
5+
- 0.0-2.9: Strongly negative sentiment (e.g., expresses anger, sadness, or despair)
6+
- 3.0-6.9: Neutral or mixed sentiment (e.g., factual statements, ambiguous content)
7+
- 7.0-10.0: Strongly positive sentiment (e.g., expresses joy, satisfaction, or hope)
8+
9+
"{input_text}"
10+
11+
Rate the sentiment on a scale from 0.0 to 10.0:
12+
- 0.0-2.9: Strongly negative (e.g., "This product is terrible")
13+
- 3.0-6.9: Neutral/mixed (e.g., "The sky is blue today")
14+
- 7.0-10.0: Strongly positive (e.g., "This is amazing!")
15+
16+
Provide only the numeric score (e.g., "8.5") without any additional text:
17+
18+
Score:
19+
# EVOLVE-BLOCK-END
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Configuration for prompt optimization
2+
max_iterations: 30
3+
checkpoint_interval: 10
4+
log_level: "INFO"
5+
6+
# LLM configuration
7+
llm:
8+
primary_model: "qwen3-32b-fp8"
9+
api_base: "http://localhost:1234/v1"
10+
api_key: "your_api_key_here"
11+
temperature: 0.9
12+
top_p: 0.95
13+
max_tokens: 2048
14+
15+
# Prompt configuration
16+
prompt:
17+
system_message: |
18+
You are an expert prompt engineer. Your task is to revise an existing prompt designed for large language models (LLMs), without being explicitly told what the task is.
19+
20+
Your improvements should:
21+
22+
* Infer the intended task and expected output format based on the structure and language of the original prompt.
23+
* Clarify vague instructions, eliminate ambiguity, and improve overall interpretability for the LLM.
24+
* Strengthen alignment between the prompt and the desired task outcome, ensuring more consistent and accurate responses.
25+
* Improve robustness against edge cases or unclear input phrasing.
26+
* If helpful, include formatting instructions, boundary conditions, or illustrative examples that reinforce the LLM's expected behavior.
27+
* Avoid adding unnecessary verbosity or assumptions not grounded in the original prompt.
28+
29+
You will receive a prompt that uses the following structure:
30+
31+
```python
32+
prompt.format(input_text=some_text)
33+
```
34+
35+
The revised prompt should maintain the same input interface but be more effective, reliable, and production-ready for LLM use.
36+
37+
Return only the improved prompt text. Do not include explanations or additional comments. Your output should be a clean, high-quality replacement that enhances clarity, consistency, and LLM performance.
38+
39+
num_top_programs: 8
40+
use_template_stochasticity: true
41+
42+
# Database configuration
43+
database:
44+
population_size: 40
45+
archive_size: 20
46+
num_islands: 3
47+
elite_selection_ratio: 0.25
48+
exploitation_ratio: 0.65
49+
50+
# Evaluator configuration
51+
evaluator:
52+
timeout: 45
53+
use_llm_feedback: true
54+
55+
# Evolution settings
56+
diff_based_evolution: true
57+
allow_full_rewrites: true
58+
diversity_threshold: 0.1

0 commit comments

Comments
 (0)