Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
results/
examples/lm_eval/prompts/system_message.txt
examples/lm_eval/prompts/evaluator_system_message.txt

# Python
__pycache__/
*.py[cod]
Expand Down Expand Up @@ -48,4 +52,4 @@ htmlcov/

# For SR
secrets.yaml
problems
problems
7 changes: 6 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -48,4 +48,9 @@ docker-build:
# Run the Docker container with the example
.PHONY: docker-run
docker-run:
docker run --rm -v $(PROJECT_DIR):/app $(DOCKER_IMAGE) examples/function_minimization/initial_program.py examples/function_minimization/evaluator.py --config examples/function_minimization/config.yaml --iterations 1000
docker run --rm -v $(PROJECT_DIR):/app --network="host" $(DOCKER_IMAGE) examples/function_minimization/initial_program.py examples/function_minimization/evaluator.py --config examples/function_minimization/config.yaml --iterations 1000

# Run the lm-eval benchmark
.PHONY: lm-eval
lm-eval:
$(PYTHON) scripts/lm_eval/lm-eval.py
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ cat checkpoints/checkpoint_*/best_program_info.json | grep -A 10 metrics
You can also install and execute via Docker:
```bash
docker build -t openevolve .
docker run --rm -v $(pwd):/app openevolve examples/function_minimization/initial_program.py examples/function_minimization/evaluator.py --config examples/function_minimization/config.yaml --iterations 1000
docker run --rm -v $(pwd):/app --network="host" openevolve examples/function_minimization/initial_program.py examples/function_minimization/evaluator.py --config examples/function_minimization/config.yaml --iterations 1000
```

## Configuration
Expand Down
78 changes: 78 additions & 0 deletions examples/lm_eval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# lm-eval.py

`lm-eval.py` provides basic benchmark capability for LLM feedback-based evolutionary task solving. The benchmark framework is [EleutherAI's lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).

*Limitation:* Only generation-only tasks such as gsm8k are supported. This is because tasks that require loglikelihood probabilities are not well applicable to agents.

## Usage

```bash
$ python3 examples/lm_eval/lm-eval.py -h
usage: lm-eval.py [-h] [--config CONFIG] [--init_file INIT_FILE] [--evaluator_file EVALUATOR_FILE] [--iterations ITERATIONS] [--limit LIMIT] [--tasks TASKS]
[--output_path OUTPUT_PATH]

OpenEvolve <-> lm-evaluation-harness adapter.

options:
-h, --help show this help message and exit
--config CONFIG config file
--init_file INIT_FILE
initial content file
--evaluator_file EVALUATOR_FILE
evaluator file
--iterations ITERATIONS
number of iterations
--limit LIMIT limit the number of examples per task that are executed
--tasks TASKS list of tasks to evaluate
--output_path OUTPUT_PATH
output path for results
```

Early examples that **were meant to** indicate that more evolution iterations improve task performance -- I suspect the prompting may not be ideal yet:
```
$ python3 examples/lm_eval/lm-eval.py --tasks gsm8k --limit 10 --iterations 1
[..]
Headline metrics:
gsm8k exact_match,strict-match 80.000%
[..]


$ python3 examples/lm_eval/lm-eval.py --tasks gsm8k --limit 10 --iterations 3
[..]
Headline metrics:
gsm8k exact_match,strict-match 90.000%
[..]

$ python3 examples/lm_eval/lm-eval.py --tasks gsm8k --limit 10 --iterations 10
[..]
Headline metrics:
gsm8k exact_match,strict-match 80.000%
[..]

$ python3 examples/lm_eval/lm-eval.py --tasks gsm8k --limit 10 --iterations 15
[..]
Headline metrics:
gsm8k exact_match,strict-match 70.000%
[..]
```

## Warning

- Be aware that this is an early implementation. No extensive benchmarks have been executed so far. With a limit to 10 tasks and 10 iterations, the benchmark is meaningless as is.
- Use the --limit parameter only for tests, not for metric generation.
- Do not cite the metrics that result from the script execution blindly without reviewing the solution first.

## References

```bibtex
@misc{eval-harness,
author = {Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy},
title = {The Language Model Evaluation Harness},
month = 07,
year = 2024,
publisher = {Zenodo},
version = {v0.4.3},
doi = {10.5281/zenodo.12608602},
url = {https://zenodo.org/records/12608602}
}
```
48 changes: 48 additions & 0 deletions examples/lm_eval/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
max_iterations: 1
checkpoint_interval: 10
log_level: "INFO"

# LLM configuration
llm:
primary_model: "gemma3:12b-it-qat"
#primary_model: "gpt-4o"
primary_model_weight: 0.8
secondary_model: "gemma3:12b-it-qat"
#secondary_model: "gpt-4.1"
secondary_model_weight: 0.2
# api_base: "https://generativelanguage.googleapis.com/v1beta/openai/"
# api_base: "https://api.openai.com/v1/"
api_base: "http://localhost:11434/v1/"
api_key: "ollama"
temperature: 0.7
top_p: 0.95
max_tokens: 4096

# Prompt configuration
prompt:
num_top_programs: 3
use_template_stochasticity: true
# System prompt is created dynamically during the benchmark in file system_message.txt!
template_dir: "examples/lm_eval/prompts"

# Database configuration
database:
population_size: 50
archive_size: 20
num_islands: 3
elite_selection_ratio: 0.2
exploitation_ratio: 0.7

# Evaluator configuration
evaluator:
timeout: 60
cascade_evaluation: false
cascade_thresholds: [0.5, 0.75]
parallel_evaluations: 4
use_llm_feedback: true
llm_feedback_weight: 1.0


# Evolution settings
diff_based_evolution: false
allow_full_rewrites: true
6 changes: 6 additions & 0 deletions examples/lm_eval/evaluator_stub.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
def evaluate_stage1(file_path):
return {"not_implemented": 0.0}


def evaluate(file_path):
return evaluate_stage1(file_path)
1 change: 1 addition & 0 deletions examples/lm_eval/initial_content_stub.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
insert the answer to the task here!
209 changes: 209 additions & 0 deletions examples/lm_eval/lm-eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
"""
OpenEvolve <-> lm-evaluation-harness adapter

Implements generation only, no loglikelihood. Tasks such as GSM8K / BoolQ / MMLU-Math /
AQUA-RAT and most code suites should work fine because they grade on the generated
answer string.
"""

from __future__ import annotations
import subprocess, tempfile, json, os, argparse, math, pathlib
from pathlib import Path
from typing import List, Dict, Tuple, Any, Iterable

import lm_eval
from lm_eval.tasks import TaskManager
from lm_eval.evaluator import evaluate
from lm_eval.api.model import LM
from lm_eval.api.registry import register_model
from datetime import datetime

# cd to the parent parent directory of this file
os.chdir(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))

PIPELINE_CMD = ["python3", "openevolve-run.py"]


@register_model("openevolve")
class OpenEvolve(LM):
def __init__(
self,
init_file: str = "initial_content_stub.txt",
evaluator_file: str = "evaluator_stub.py",
config_file: str = "config.yml",
iterations: int = 5,
extra_param: List[str] = [],
**kwargs,
):
super().__init__()
self.init_file = init_file
self.evaluator_file = evaluator_file
self.iterations = iterations
self.extra_param = extra_param
self.config_file = config_file

# folder must match prompt:template_dir in config.yml!
self.prompt_path = "examples/lm_eval/prompts/system_message.txt"
self.evaluator_prompt_path = "examples/lm_eval/prompts/evaluator_system_message.txt"
self.best_path = "examples/lm_eval/openevolve_output/best/best_program.txt"
self.base_system_message = "You are an expert task solver, with a lot of commonsense, math, language and coding knowledge.\n\nConsider this task:\n```{prompt}´´´"

def generate(self, prompts: List[str], max_gen_toks: int = None, stop=None, **kwargs):
outs = []
for prompt in prompts:
# Task prompt becomes the system message. User prompt is the evolutionary logic.
# We create temporary prompt files with the system message
with Path(self.prompt_path).open("w") as f:
f.write(self.base_system_message.format(prompt=prompt))

with Path(self.evaluator_prompt_path).open("w") as f:
f.write(self.base_system_message.format(prompt=prompt))

cmd = (
PIPELINE_CMD
+ ["--config", self.config_file]
+ ["--iterations", str(self.iterations)]
+ self.extra_param
+ [self.init_file, self.evaluator_file]
)
print(f"Running command: {' '.join(cmd)}")
try:
res = subprocess.run(cmd, capture_output=True, text=True, check=True)
text = res.stdout.strip()
print(f"Process output: {text}")
except subprocess.CalledProcessError as e:
print(f"Command failed with return code {e.returncode}")
print(f"stderr: {e.stderr}")
text = ""

print(f"# Prompt: {prompt}")
with Path(self.best_path).open("r") as f:
best = f.read().strip()
print(f"# Answer: {best}")

# honour stop tokens
if stop:
for s in stop:
idx = best.find(s)
if idx != -1:
best = best[:idx]
break
outs.append(best)
return outs

# for tasks that ask for log likelihood, indicate that it is unsupported
def loglikelihood(self, requests: Iterable[Tuple[str, str]], **kw):
# return [(-math.inf, False) for _ in requests]
raise NotImplementedError

def loglikelihood_rolling(self, requests: Iterable[str], **kw):
# return [(-math.inf, False) for _ in requests]
raise NotImplementedError

def generate_until(self, requests: Iterable[Any], **kw) -> List[str]:
ctxs, stops = [], []

for req in requests:
# ---------------- old: plain tuple ----------------
if isinstance(req, tuple):
ctx, until = req

# -------------- new: Instance object --------------
else:
ctx = req.args[0] # first positional arg
until = []
# if a second positional arg exists and is list-like,
# treat it as the stop sequence
if len(req.args) > 1 and isinstance(req.args[1], (list, tuple)):
until = list(req.args[1])

ctxs.append(ctx)
stops.append(until)

# 2) run your real generator once per context
gens = self.generate(ctxs, stop=None)

# 3) post-trim at the first stop sequence
cleaned = []
for g, until in zip(gens, stops):
for s in until:
idx = g.find(s)
if idx != -1:
g = g[:idx]
break
cleaned.append(g)
return cleaned


if __name__ == "__main__":
# cli arguments for primary model, secondary model, iterations, config and tasks
p = argparse.ArgumentParser(
description="OpenEvolve <-> lm-evaluation-harness adapter.",
)
p.add_argument("--config", default="examples/lm_eval/config.yml", help="config file")
p.add_argument(
"--init_file",
default="examples/lm_eval/initial_content_stub.txt",
help="initial content file",
)
p.add_argument(
"--evaluator_file", default="examples/lm_eval/evaluator_stub.py", help="evaluator file"
)
p.add_argument("--iterations", default=5, type=int, help="number of iterations")
p.add_argument(
"--limit",
default=None,
type=int,
help="limit the number of examples per task that are executed",
)
# p.add_argument("--tasks", default="boolq,gsm8k,mmlu", help="comma-list of tasks to evaluate")
p.add_argument("--tasks", default="gsm8k", help="list of tasks to evaluate")
p.add_argument("--output_path", default="results", help="output path for results")
args = p.parse_args()

lm_obj = OpenEvolve(
init_file=args.init_file,
evaluator_file=args.evaluator_file,
iterations=args.iterations,
config_file=args.config,
)

task_dict = lm_eval.tasks.get_task_dict(args.tasks.split(","))

results = evaluate(
lm=lm_obj,
task_dict=task_dict,
limit=args.limit,
)

# write out the results
pathlib.Path(
args.output_path,
).mkdir(exist_ok=True)

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
results_path = pathlib.Path(
os.path.join(
args.output_path,
f"{timestamp}_iter{args.iterations}.json",
)
)

with results_path.open("w") as f:
json.dump(results, f, indent=2)

# print result summary
short = {}
for task, metrics in results["results"].items():
# pick the first value that is a real number
for key, val in metrics.items():
if isinstance(val, (int, float)):
short[task] = (key, val) # store *both* name & value
break

print(f"Full results written to {results_path}\n")
print("Headline metrics:")
for task, (name, value) in short.items():
print(f" {task:<15} {name:<12} {value:.3%}")

print("\nNote: Never cite the overall average when some components were skipped!")
Loading