Skip to content

Commit 2487087

Browse files
committed
add README for prompt engineering example
1 parent d08fadc commit 2487087

File tree

1 file changed

+46
-0
lines changed

1 file changed

+46
-0
lines changed

examples/prompt/README.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# AIME 2024 Prompt‑Engineering Example
2+
This example shows how **Weco** can iteratively improve a prompt for solving American Invitational Mathematics Examination (AIME) problems. The experiment runs locally, requires only two short Python files, and finishes in a few hours on a laptop.
3+
4+
## Files in this folder
5+
6+
| File | Purpose |
7+
| --- | --- |
8+
| `optimize.py` | Holds the prompt template, the mutable `EXTRA_INSTRUCTIONS` string, and the LLM call. Weco edits **only** this file during the search. |
9+
| `eval_aime.py` | Downloads a small slice of the 2024 AIME dataset, calls `optimize.solve` in parallel, prints progress logs, and finally prints an `accuracy:` line that Weco reads. |
10+
11+
## Quick start
12+
13+
1. **Clone the repository and enter the folder.**
14+
```bash
15+
git clone https://github.com/your‑fork/weco‑examples.git
16+
cd weco‑examples/aime‑2024
17+
```
18+
2. **Run Weco.** The command below edits `EXTRA_INSTRUCTIONS` in `optimize.py`, invokes `eval.py` on every iteration, reads the printed accuracy, and keeps the best variants.
19+
```bash
20+
weco --source optimize.py \
21+
--eval-command "python eval.py" \
22+
--metric accuracy \
23+
--maximize true \
24+
--steps 40 \
25+
--model gemini-2.5-pro-exp-03-25
26+
```
27+
28+
During each evaluation round you will see log lines similar to the following.
29+
30+
```text
31+
[setup] loading 20 problems from AIME 2024 …
32+
[progress] 5/20 completed, elapsed 7.3 s
33+
[progress] 10/20 completed, elapsed 14.6 s
34+
[progress] 15/20 completed, elapsed 21.8 s
35+
[progress] 20/20 completed, elapsed 28.9 s
36+
accuracy: 0.0500
37+
```
38+
39+
Weco then mutates the suffix, tries again, and gradually pushes the accuracy higher. On a modern laptop you can usually double the baseline score within thirty to forty iterations.
40+
41+
## How it works
42+
43+
* `eval_aime.py` slices the **Maxwell‑Jia/AIME_2024** dataset to twenty problems for fast feedback. You can change the slice in one line.
44+
* The script sends model calls in parallel via `ThreadPoolExecutor`, so network latency is hidden.
45+
* Every five completed items, the script logs progress and elapsed time.
46+
* The final line `accuracy: value` is the only part Weco needs for guidance.

0 commit comments

Comments
 (0)