Skip to content

Commit 4b099e3

Browse files
authored
Merge pull request #49 from jvm123/feat-lm-eval
Feature: Benchmarks with EleutherAI lm-evaluation-harness
2 parents 3fc9465 + e93890e commit 4b099e3

18 files changed

+452
-10
lines changed

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
results/
2+
examples/lm_eval/prompts/system_message.txt
3+
examples/lm_eval/prompts/evaluator_system_message.txt
4+
15
# Python
26
__pycache__/
37
*.py[cod]
@@ -48,4 +52,4 @@ htmlcov/
4852

4953
# For SR
5054
secrets.yaml
51-
problems
55+
problems

Makefile

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,4 +48,9 @@ docker-build:
4848
# Run the Docker container with the example
4949
.PHONY: docker-run
5050
docker-run:
51-
docker run --rm -v $(PROJECT_DIR):/app $(DOCKER_IMAGE) examples/function_minimization/initial_program.py examples/function_minimization/evaluator.py --config examples/function_minimization/config.yaml --iterations 1000
51+
docker run --rm -v $(PROJECT_DIR):/app --network="host" $(DOCKER_IMAGE) examples/function_minimization/initial_program.py examples/function_minimization/evaluator.py --config examples/function_minimization/config.yaml --iterations 1000
52+
53+
# Run the lm-eval benchmark
54+
.PHONY: lm-eval
55+
lm-eval:
56+
$(PYTHON) scripts/lm_eval/lm-eval.py

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ cat checkpoints/checkpoint_*/best_program_info.json | grep -A 10 metrics
133133
You can also install and execute via Docker:
134134
```bash
135135
docker build -t openevolve .
136-
docker run --rm -v $(pwd):/app openevolve examples/function_minimization/initial_program.py examples/function_minimization/evaluator.py --config examples/function_minimization/config.yaml --iterations 1000
136+
docker run --rm -v $(pwd):/app --network="host" openevolve examples/function_minimization/initial_program.py examples/function_minimization/evaluator.py --config examples/function_minimization/config.yaml --iterations 1000
137137
```
138138

139139
## Configuration

examples/lm_eval/README.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# lm-eval.py
2+
3+
`lm-eval.py` provides basic benchmark capability for LLM feedback-based evolutionary task solving. The benchmark framework is [EleutherAI's lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
4+
5+
*Limitation:* Only generation-only tasks such as gsm8k are supported. This is because tasks that require loglikelihood probabilities are not well applicable to agents.
6+
7+
## Usage
8+
9+
```bash
10+
$ python3 examples/lm_eval/lm-eval.py -h
11+
usage: lm-eval.py [-h] [--config CONFIG] [--init_file INIT_FILE] [--evaluator_file EVALUATOR_FILE] [--iterations ITERATIONS] [--limit LIMIT] [--tasks TASKS]
12+
[--output_path OUTPUT_PATH]
13+
14+
OpenEvolve <-> lm-evaluation-harness adapter.
15+
16+
options:
17+
-h, --help show this help message and exit
18+
--config CONFIG config file
19+
--init_file INIT_FILE
20+
initial content file
21+
--evaluator_file EVALUATOR_FILE
22+
evaluator file
23+
--iterations ITERATIONS
24+
number of iterations
25+
--limit LIMIT limit the number of examples per task that are executed
26+
--tasks TASKS list of tasks to evaluate
27+
--output_path OUTPUT_PATH
28+
output path for results
29+
```
30+
31+
Early examples that **were meant to** indicate that more evolution iterations improve task performance -- I suspect the prompting may not be ideal yet:
32+
```
33+
$ python3 examples/lm_eval/lm-eval.py --tasks gsm8k --limit 10 --iterations 1
34+
[..]
35+
Headline metrics:
36+
gsm8k exact_match,strict-match 80.000%
37+
[..]
38+
39+
40+
$ python3 examples/lm_eval/lm-eval.py --tasks gsm8k --limit 10 --iterations 3
41+
[..]
42+
Headline metrics:
43+
gsm8k exact_match,strict-match 90.000%
44+
[..]
45+
46+
$ python3 examples/lm_eval/lm-eval.py --tasks gsm8k --limit 10 --iterations 10
47+
[..]
48+
Headline metrics:
49+
gsm8k exact_match,strict-match 80.000%
50+
[..]
51+
52+
$ python3 examples/lm_eval/lm-eval.py --tasks gsm8k --limit 10 --iterations 15
53+
[..]
54+
Headline metrics:
55+
gsm8k exact_match,strict-match 70.000%
56+
[..]
57+
```
58+
59+
## Warning
60+
61+
- Be aware that this is an early implementation. No extensive benchmarks have been executed so far. With a limit to 10 tasks and 10 iterations, the benchmark is meaningless as is.
62+
- Use the --limit parameter only for tests, not for metric generation.
63+
- Do not cite the metrics that result from the script execution blindly without reviewing the solution first.
64+
65+
## References
66+
67+
```bibtex
68+
@misc{eval-harness,
69+
author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
70+
title = {The Language Model Evaluation Harness},
71+
month = 07,
72+
year = 2024,
73+
publisher = {Zenodo},
74+
version = {v0.4.3},
75+
doi = {10.5281/zenodo.12608602},
76+
url = {https://zenodo.org/records/12608602}
77+
}
78+
```

examples/lm_eval/config.yml

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
max_iterations: 1
2+
checkpoint_interval: 10
3+
log_level: "INFO"
4+
5+
# LLM configuration
6+
llm:
7+
primary_model: "gemma3:12b-it-qat"
8+
#primary_model: "gpt-4o"
9+
primary_model_weight: 0.8
10+
secondary_model: "gemma3:12b-it-qat"
11+
#secondary_model: "gpt-4.1"
12+
secondary_model_weight: 0.2
13+
# api_base: "https://generativelanguage.googleapis.com/v1beta/openai/"
14+
# api_base: "https://api.openai.com/v1/"
15+
api_base: "http://localhost:11434/v1/"
16+
api_key: "ollama"
17+
temperature: 0.7
18+
top_p: 0.95
19+
max_tokens: 4096
20+
21+
# Prompt configuration
22+
prompt:
23+
num_top_programs: 3
24+
use_template_stochasticity: true
25+
# System prompt is created dynamically during the benchmark in file system_message.txt!
26+
template_dir: "examples/lm_eval/prompts"
27+
28+
# Database configuration
29+
database:
30+
population_size: 50
31+
archive_size: 20
32+
num_islands: 3
33+
elite_selection_ratio: 0.2
34+
exploitation_ratio: 0.7
35+
36+
# Evaluator configuration
37+
evaluator:
38+
timeout: 60
39+
cascade_evaluation: false
40+
cascade_thresholds: [0.5, 0.75]
41+
parallel_evaluations: 4
42+
use_llm_feedback: true
43+
llm_feedback_weight: 1.0
44+
45+
46+
# Evolution settings
47+
diff_based_evolution: false
48+
allow_full_rewrites: true

examples/lm_eval/evaluator_stub.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
def evaluate_stage1(file_path):
2+
return {"not_implemented": 0.0}
3+
4+
5+
def evaluate(file_path):
6+
return evaluate_stage1(file_path)
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
insert the answer to the task here!

examples/lm_eval/lm-eval.py

Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
"""
2+
OpenEvolve <-> lm-evaluation-harness adapter
3+
4+
Implements generation only, no loglikelihood. Tasks such as GSM8K / BoolQ / MMLU-Math /
5+
AQUA-RAT and most code suites should work fine because they grade on the generated
6+
answer string.
7+
"""
8+
9+
from __future__ import annotations
10+
import subprocess, tempfile, json, os, argparse, math, pathlib
11+
from pathlib import Path
12+
from typing import List, Dict, Tuple, Any, Iterable
13+
14+
import lm_eval
15+
from lm_eval.tasks import TaskManager
16+
from lm_eval.evaluator import evaluate
17+
from lm_eval.api.model import LM
18+
from lm_eval.api.registry import register_model
19+
from datetime import datetime
20+
21+
# cd to the parent parent directory of this file
22+
os.chdir(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
23+
24+
PIPELINE_CMD = ["python3", "openevolve-run.py"]
25+
26+
27+
@register_model("openevolve")
28+
class OpenEvolve(LM):
29+
def __init__(
30+
self,
31+
init_file: str = "initial_content_stub.txt",
32+
evaluator_file: str = "evaluator_stub.py",
33+
config_file: str = "config.yml",
34+
iterations: int = 5,
35+
extra_param: List[str] = [],
36+
**kwargs,
37+
):
38+
super().__init__()
39+
self.init_file = init_file
40+
self.evaluator_file = evaluator_file
41+
self.iterations = iterations
42+
self.extra_param = extra_param
43+
self.config_file = config_file
44+
45+
# folder must match prompt:template_dir in config.yml!
46+
self.prompt_path = "examples/lm_eval/prompts/system_message.txt"
47+
self.evaluator_prompt_path = "examples/lm_eval/prompts/evaluator_system_message.txt"
48+
self.best_path = "examples/lm_eval/openevolve_output/best/best_program.txt"
49+
self.base_system_message = "You are an expert task solver, with a lot of commonsense, math, language and coding knowledge.\n\nConsider this task:\n```{prompt}´´´"
50+
51+
def generate(self, prompts: List[str], max_gen_toks: int = None, stop=None, **kwargs):
52+
outs = []
53+
for prompt in prompts:
54+
# Task prompt becomes the system message. User prompt is the evolutionary logic.
55+
# We create temporary prompt files with the system message
56+
with Path(self.prompt_path).open("w") as f:
57+
f.write(self.base_system_message.format(prompt=prompt))
58+
59+
with Path(self.evaluator_prompt_path).open("w") as f:
60+
f.write(self.base_system_message.format(prompt=prompt))
61+
62+
cmd = (
63+
PIPELINE_CMD
64+
+ ["--config", self.config_file]
65+
+ ["--iterations", str(self.iterations)]
66+
+ self.extra_param
67+
+ [self.init_file, self.evaluator_file]
68+
)
69+
print(f"Running command: {' '.join(cmd)}")
70+
try:
71+
res = subprocess.run(cmd, capture_output=True, text=True, check=True)
72+
text = res.stdout.strip()
73+
print(f"Process output: {text}")
74+
except subprocess.CalledProcessError as e:
75+
print(f"Command failed with return code {e.returncode}")
76+
print(f"stderr: {e.stderr}")
77+
text = ""
78+
79+
print(f"# Prompt: {prompt}")
80+
with Path(self.best_path).open("r") as f:
81+
best = f.read().strip()
82+
print(f"# Answer: {best}")
83+
84+
# honour stop tokens
85+
if stop:
86+
for s in stop:
87+
idx = best.find(s)
88+
if idx != -1:
89+
best = best[:idx]
90+
break
91+
outs.append(best)
92+
return outs
93+
94+
# for tasks that ask for log likelihood, indicate that it is unsupported
95+
def loglikelihood(self, requests: Iterable[Tuple[str, str]], **kw):
96+
# return [(-math.inf, False) for _ in requests]
97+
raise NotImplementedError
98+
99+
def loglikelihood_rolling(self, requests: Iterable[str], **kw):
100+
# return [(-math.inf, False) for _ in requests]
101+
raise NotImplementedError
102+
103+
def generate_until(self, requests: Iterable[Any], **kw) -> List[str]:
104+
ctxs, stops = [], []
105+
106+
for req in requests:
107+
# ---------------- old: plain tuple ----------------
108+
if isinstance(req, tuple):
109+
ctx, until = req
110+
111+
# -------------- new: Instance object --------------
112+
else:
113+
ctx = req.args[0] # first positional arg
114+
until = []
115+
# if a second positional arg exists and is list-like,
116+
# treat it as the stop sequence
117+
if len(req.args) > 1 and isinstance(req.args[1], (list, tuple)):
118+
until = list(req.args[1])
119+
120+
ctxs.append(ctx)
121+
stops.append(until)
122+
123+
# 2) run your real generator once per context
124+
gens = self.generate(ctxs, stop=None)
125+
126+
# 3) post-trim at the first stop sequence
127+
cleaned = []
128+
for g, until in zip(gens, stops):
129+
for s in until:
130+
idx = g.find(s)
131+
if idx != -1:
132+
g = g[:idx]
133+
break
134+
cleaned.append(g)
135+
return cleaned
136+
137+
138+
if __name__ == "__main__":
139+
# cli arguments for primary model, secondary model, iterations, config and tasks
140+
p = argparse.ArgumentParser(
141+
description="OpenEvolve <-> lm-evaluation-harness adapter.",
142+
)
143+
p.add_argument("--config", default="examples/lm_eval/config.yml", help="config file")
144+
p.add_argument(
145+
"--init_file",
146+
default="examples/lm_eval/initial_content_stub.txt",
147+
help="initial content file",
148+
)
149+
p.add_argument(
150+
"--evaluator_file", default="examples/lm_eval/evaluator_stub.py", help="evaluator file"
151+
)
152+
p.add_argument("--iterations", default=5, type=int, help="number of iterations")
153+
p.add_argument(
154+
"--limit",
155+
default=None,
156+
type=int,
157+
help="limit the number of examples per task that are executed",
158+
)
159+
# p.add_argument("--tasks", default="boolq,gsm8k,mmlu", help="comma-list of tasks to evaluate")
160+
p.add_argument("--tasks", default="gsm8k", help="list of tasks to evaluate")
161+
p.add_argument("--output_path", default="results", help="output path for results")
162+
args = p.parse_args()
163+
164+
lm_obj = OpenEvolve(
165+
init_file=args.init_file,
166+
evaluator_file=args.evaluator_file,
167+
iterations=args.iterations,
168+
config_file=args.config,
169+
)
170+
171+
task_dict = lm_eval.tasks.get_task_dict(args.tasks.split(","))
172+
173+
results = evaluate(
174+
lm=lm_obj,
175+
task_dict=task_dict,
176+
limit=args.limit,
177+
)
178+
179+
# write out the results
180+
pathlib.Path(
181+
args.output_path,
182+
).mkdir(exist_ok=True)
183+
184+
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
185+
results_path = pathlib.Path(
186+
os.path.join(
187+
args.output_path,
188+
f"{timestamp}_iter{args.iterations}.json",
189+
)
190+
)
191+
192+
with results_path.open("w") as f:
193+
json.dump(results, f, indent=2)
194+
195+
# print result summary
196+
short = {}
197+
for task, metrics in results["results"].items():
198+
# pick the first value that is a real number
199+
for key, val in metrics.items():
200+
if isinstance(val, (int, float)):
201+
short[task] = (key, val) # store *both* name & value
202+
break
203+
204+
print(f"Full results written to {results_path}\n")
205+
print("Headline metrics:")
206+
for task, (name, value) in short.items():
207+
print(f" {task:<15} {name:<12} {value:.3%}")
208+
209+
print("\nNote: Never cite the overall average when some components were skipped!")

0 commit comments

Comments
 (0)