Skip to content

Commit 1ed8867

Browse files
committed
Add web scraper evolution example using optillm
Introduces a new example in examples/web_scraper_optillm demonstrating web scraper evolution with optillm and OpenEvolve. Includes a detailed README, configuration for optillm with readurls and Mixture of Agents, an evaluator for robust function extraction, an initial BeautifulSoup-based scraper, and required dependencies.
1 parent 42acdc3 commit 1ed8867

File tree

5 files changed

+843
-0
lines changed

5 files changed

+843
-0
lines changed
Lines changed: 247 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,247 @@
1+
# Web Scraper Evolution with optillm
2+
3+
This example demonstrates how to use [optillm](https://github.com/codelion/optillm) with OpenEvolve to leverage test-time compute techniques for improved code evolution accuracy. We'll evolve a web scraper that extracts structured data from documentation pages, showcasing two key optillm features:
4+
5+
1. **readurls plugin**: Automatically fetches webpage content when URLs are mentioned in prompts
6+
2. **Inference optimization**: Uses techniques like Mixture of Agents (MoA) to improve response accuracy
7+
8+
## Why optillm?
9+
10+
Traditional LLM usage in code evolution has limitations:
11+
- LLMs may not have knowledge of the latest library documentation
12+
- Single LLM calls can produce inconsistent or incorrect code
13+
- No ability to dynamically fetch relevant documentation during evolution
14+
15+
optillm solves these problems by:
16+
- **Dynamic Documentation Fetching**: The readurls plugin automatically fetches and includes webpage content when URLs are detected in prompts
17+
- **Test-Time Compute**: Techniques like MoA generate multiple responses and synthesize the best solution
18+
- **Flexible Routing**: Can route requests to different models based on requirements
19+
20+
## Problem Description
21+
22+
We're evolving a web scraper that extracts API documentation from Python library documentation pages. The scraper needs to:
23+
1. Parse HTML documentation pages
24+
2. Extract function signatures, descriptions, and parameters
25+
3. Structure the data in a consistent format
26+
4. Handle various documentation formats
27+
28+
This is an ideal problem for optillm because:
29+
- The LLM benefits from seeing actual documentation HTML structure
30+
- Accuracy is crucial for correct parsing
31+
- Different documentation sites have different formats
32+
33+
## Architecture
34+
35+
```
36+
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
37+
│ OpenEvolve │────▶│ optillm │────▶│ Local LLM │
38+
│ │ │ (proxy:8000) │ │ (Qwen-0.5B) │
39+
└─────────────────┘ └─────────────────┘ └─────────────────┘
40+
41+
├── readurls plugin
42+
│ (fetches web content)
43+
44+
└── MoA optimization
45+
(improves accuracy)
46+
```
47+
48+
## Setup Instructions
49+
50+
### 1. Install and Configure optillm
51+
52+
```bash
53+
# Clone optillm
54+
git clone https://github.com/codelion/optillm.git
55+
cd optillm
56+
57+
# Install dependencies
58+
pip install -r requirements.txt
59+
60+
# Start optillm proxy with local inference server (in a separate terminal)
61+
export OPTILLM_API_KEY=optillm
62+
python optillm.py --port 8000
63+
```
64+
65+
optillm will now be running on `http://localhost:8000` with its built-in local inference server.
66+
67+
**Note for Non-Mac Users**: This example uses `Qwen/Qwen3-0.6B-MLX-bf16` which is optimized for Apple Silicon (M1/M2/M3 chips). If you're not using a Mac, you should:
68+
69+
1. **For NVIDIA GPUs**: Use a CUDA-compatible model like:
70+
- `Qwen/Qwen2.5-32B-Instruct` (best quality, high VRAM)
71+
- `Qwen/Qwen2.5-14B-Instruct` (good balance)
72+
- `meta-llama/Llama-3.1-8B-Instruct` (efficient option)
73+
- `Qwen/Qwen2.5-7B-Instruct` (lower VRAM)
74+
75+
2. **For CPU-only**: Use a smaller model like:
76+
- `Qwen/Qwen2.5-7B-Instruct` (7B parameters)
77+
- `meta-llama/Llama-3.2-3B-Instruct` (3B parameters)
78+
- `Qwen/Qwen2.5-3B-Instruct` (3B parameters)
79+
80+
3. **Update the config**: Replace the model names in `config.yaml` with your chosen model:
81+
```yaml
82+
models:
83+
- name: "readurls-your-chosen-model"
84+
weight: 0.6
85+
- name: "moa&readurls-your-chosen-model"
86+
weight: 0.4
87+
```
88+
89+
### 2. Install Web Scraping Dependencies
90+
91+
```bash
92+
# Install required Python packages for the example
93+
pip install -r examples/web_scraper_optillm/requirements.txt
94+
```
95+
96+
### 3. Run the Evolution
97+
98+
```bash
99+
# From the openevolve root directory
100+
export OPENAI_API_KEY=optillm
101+
python openevolve-run.py examples/web_scraper_optillm/initial_program.py \
102+
examples/web_scraper_optillm/evaluator.py \
103+
--config examples/web_scraper_optillm/config.yaml \
104+
--iterations 100
105+
```
106+
107+
The configuration demonstrates both optillm capabilities:
108+
- **Primary model (90%)**: `readurls-Qwen/Qwen3-0.6B-MLX-bf16` - fetches URLs mentioned in prompts
109+
- **Secondary model (10%)**: `moa&readurls-Qwen/Qwen3-0.6B-MLX-bf16` - uses Mixture of Agents for improved accuracy
110+
111+
## How It Works
112+
113+
### 1. readurls Plugin
114+
115+
When the evolution prompt contains URLs (e.g., "Parse the documentation at https://docs.python.org/3/library/json.html"), the readurls plugin:
116+
1. Detects the URL in the prompt
117+
2. Fetches the webpage content
118+
3. Extracts text and table data
119+
4. Appends it to the prompt as context
120+
121+
This ensures the LLM has access to the latest documentation structure when generating code.
122+
123+
### 2. Mixture of Agents (MoA)
124+
125+
The MoA technique improves accuracy by:
126+
1. Generating 3 different solutions to the problem
127+
2. Having each "agent" critique all solutions
128+
3. Synthesizing a final, improved solution based on the critiques
129+
130+
This is particularly valuable for complex parsing logic where multiple approaches might be valid.
131+
132+
### 3. Evolution Process
133+
134+
1. **Initial Program**: A basic BeautifulSoup scraper that extracts simple text
135+
2. **Evaluator**: Tests the scraper against real documentation pages, checking:
136+
- Correct extraction of function names
137+
- Accurate parameter parsing
138+
- Proper handling of edge cases
139+
3. **Evolution**: The LLM improves the scraper by:
140+
- Fetching actual documentation HTML (via readurls)
141+
- Generating multiple parsing strategies (via MoA)
142+
- Learning from evaluation feedback
143+
144+
## Example Evolution Trajectory
145+
146+
**Generation 1** (Basic scraper):
147+
```python
148+
# Simple text extraction
149+
soup = BeautifulSoup(html, 'html.parser')
150+
text = soup.get_text()
151+
```
152+
153+
**Generation 10** (With readurls context):
154+
```python
155+
# Targets specific documentation structures
156+
functions = soup.find_all('dl', class_='function')
157+
for func in functions:
158+
name = func.find('dt').get('id')
159+
desc = func.find('dd').text
160+
```
161+
162+
**Generation 50** (With MoA refinement):
163+
```python
164+
# Robust parsing with error handling
165+
def extract_function_docs(soup):
166+
# Multiple strategies for different doc formats
167+
strategies = [
168+
lambda: soup.select('dl.function dt'),
169+
lambda: soup.select('.sig-name'),
170+
lambda: soup.find_all('code', class_='descname')
171+
]
172+
173+
for strategy in strategies:
174+
try:
175+
results = strategy()
176+
if results:
177+
return parse_results(results)
178+
except:
179+
continue
180+
```
181+
182+
## Monitoring Progress
183+
184+
Watch the evolution progress and see how optillm enhances the process:
185+
186+
```bash
187+
# View optillm logs (in the terminal running optillm)
188+
# You'll see:
189+
# - URLs being fetched by readurls
190+
# - Multiple completions generated by MoA
191+
# - Final synthesized responses
192+
193+
# View OpenEvolve logs
194+
tail -f examples/web_scraper_optillm/openevolve_output/evolution.log
195+
```
196+
197+
## Results
198+
199+
After evolution, you should see:
200+
1. **Improved Accuracy**: The scraper correctly handles various documentation formats
201+
2. **Better Error Handling**: Robust parsing that doesn't break on edge cases
202+
3. **Optimized Performance**: Efficient extraction strategies
203+
204+
Compare the checkpoints to see the evolution:
205+
```bash
206+
# Initial vs evolved program
207+
diff examples/web_scraper_optillm/openevolve_output/checkpoints/checkpoint_10/best_program.py \
208+
examples/web_scraper_optillm/openevolve_output/checkpoints/checkpoint_100/best_program.py
209+
```
210+
211+
## Key Insights
212+
213+
1. **Documentation Access Matters**: The readurls plugin significantly improves the LLM's ability to generate correct parsing code by providing actual HTML structure
214+
215+
2. **Test-Time Compute Works**: MoA's multiple generation and critique approach produces more robust solutions than single-shot generation
216+
217+
3. **Powerful Local Models**: Large models like Qwen-32B with 4-bit quantization provide excellent results while being memory efficient when enhanced with optillm techniques
218+
219+
## Customization
220+
221+
You can experiment with different optillm features by modifying `config.yaml`:
222+
223+
1. **Different Plugins**: Try the `executecode` plugin for runtime validation
224+
2. **Other Techniques**: Experiment with `cot_reflection`, `rstar`, or `bon`
225+
3. **Model Combinations**: Adjust weights or try different technique combinations
226+
227+
Example custom configuration:
228+
```yaml
229+
llm:
230+
models:
231+
- name: "cot_reflection&readurls-Qwen/Qwen3-0.6B-MLX-bf16"
232+
weight: 0.7
233+
- name: "moa&executecode-Qwen/Qwen3-0.6B-MLX-bf16"
234+
weight: 0.3
235+
```
236+
237+
## Troubleshooting
238+
239+
1. **optillm not responding**: Ensure it's running on port 8000 with `OPTILLM_API_KEY=optillm`
240+
2. **Model not found**: Make sure optillm's local inference server is working (check optillm logs)
241+
3. **Slow evolution**: MoA generates multiple completions, so it's slower but more accurate
242+
243+
## Further Reading
244+
245+
- [optillm Documentation](https://github.com/codelion/optillm)
246+
- [OpenEvolve Configuration Guide](../../configs/default_config.yaml)
247+
- [Mixture of Agents Paper](https://arxiv.org/abs/2406.04692)
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# optillm configuration demonstrating readurls plugin and Mixture of Agents (MoA)
2+
# This config shows both capabilities in a single configuration
3+
4+
# Evolution settings
5+
max_iterations: 100
6+
checkpoint_interval: 10
7+
parallel_evaluations: 1
8+
9+
# LLM configuration - using optillm proxy with different techniques
10+
llm:
11+
# Point to optillm proxy instead of direct LLM
12+
api_base: "http://localhost:8000/v1"
13+
14+
# Demonstrate both optillm capabilities in one config
15+
models:
16+
# Primary model: readurls plugin for URL fetching
17+
- name: "readurls-Qwen/Qwen3-1.7B-MLX-bf16"
18+
weight: 0.9
19+
20+
# Secondary model: MoA + readurls for improved accuracy
21+
- name: "moa&readurls-Qwen/Qwen3-1.7B-MLX-bf16"
22+
weight: 0.1
23+
24+
# Generation settings optimized for both techniques
25+
temperature: 0.6
26+
max_tokens: 16000 # Higher for MoA's multiple generations and critiques
27+
top_p: 0.95
28+
29+
# Request parameters optimized for local models
30+
timeout: 600 # Extended timeout for local model generation (10 minutes)
31+
retries: 3
32+
retry_delay: 5
33+
34+
# Database configuration
35+
database:
36+
population_size: 50
37+
num_islands: 3
38+
migration_interval: 10
39+
feature_dimensions:
40+
- "score"
41+
- "complexity"
42+
43+
# Evaluation settings
44+
evaluator:
45+
timeout: 300 # Extended timeout for local model evaluation (5 minutes)
46+
max_retries: 3
47+
48+
# Prompt configuration
49+
prompt:
50+
# Enhanced system message that leverages both readurls and MoA
51+
system_message: |
52+
You are an expert Python developer tasked with evolving a web scraper for API documentation.
53+
54+
Your goal is to improve the scraper's ability to extract function signatures, parameters, and descriptions
55+
from HTML documentation pages. The scraper should be robust and handle various documentation formats.
56+
57+
Key considerations:
58+
1. Parse HTML efficiently using BeautifulSoup
59+
2. Extract function names, signatures, and descriptions accurately
60+
3. Handle different documentation structures (Python docs, library docs, etc.)
61+
4. Provide meaningful error handling
62+
5. Return structured data in the expected format
63+
64+
When analyzing documentation structures, refer to actual documentation pages like:
65+
- https://docs.python.org/3/library/json.html
66+
- https://requests.readthedocs.io/en/latest/api/
67+
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/
68+
69+
Focus on improving the EVOLVE-BLOCK sections to make the scraper more accurate and robust.
70+
Consider multiple parsing strategies and implement the most effective approach.
71+
72+
# Include more examples for better context
73+
num_top_programs: 3
74+
num_diverse_programs: 2
75+
76+
# General settings
77+
log_level: "INFO"

0 commit comments

Comments
 (0)