|
| 1 | +# AIME 2024 Prompt‑Engineering Example |
| 2 | +This example shows how **Weco** can iteratively improve a prompt for solving American Invitational Mathematics Examination (AIME) problems. The experiment runs locally, requires only two short Python files, and finishes in a few hours on a laptop. |
| 3 | + |
| 4 | +## Files in this folder |
| 5 | + |
| 6 | +| File | Purpose | |
| 7 | +| --- | --- | |
| 8 | +| `optimize.py` | Holds the prompt template, the mutable `EXTRA_INSTRUCTIONS` string, and the LLM call. Weco edits **only** this file during the search. | |
| 9 | +| `eval_aime.py` | Downloads a small slice of the 2024 AIME dataset, calls `optimize.solve` in parallel, prints progress logs, and finally prints an `accuracy:` line that Weco reads. | |
| 10 | + |
| 11 | +## Quick start |
| 12 | + |
| 13 | +1. **Clone the repository and enter the folder.** |
| 14 | + ```bash |
| 15 | + git clone https://github.com/your‑fork/weco‑examples.git |
| 16 | + cd weco‑examples/aime‑2024 |
| 17 | + ``` |
| 18 | +2. **Run Weco.** The command below edits `EXTRA_INSTRUCTIONS` in `optimize.py`, invokes `eval.py` on every iteration, reads the printed accuracy, and keeps the best variants. |
| 19 | + ```bash |
| 20 | + weco --source optimize.py \ |
| 21 | + --eval-command "python eval.py" \ |
| 22 | + --metric accuracy \ |
| 23 | + --maximize true \ |
| 24 | + --steps 40 \ |
| 25 | + --model gemini-2.5-pro-exp-03-25 |
| 26 | + ``` |
| 27 | + |
| 28 | +During each evaluation round you will see log lines similar to the following. |
| 29 | + |
| 30 | +```text |
| 31 | +[setup] loading 20 problems from AIME 2024 … |
| 32 | +[progress] 5/20 completed, elapsed 7.3 s |
| 33 | +[progress] 10/20 completed, elapsed 14.6 s |
| 34 | +[progress] 15/20 completed, elapsed 21.8 s |
| 35 | +[progress] 20/20 completed, elapsed 28.9 s |
| 36 | +accuracy: 0.0500 |
| 37 | +``` |
| 38 | + |
| 39 | +Weco then mutates the suffix, tries again, and gradually pushes the accuracy higher. On a modern laptop you can usually double the baseline score within thirty to forty iterations. |
| 40 | + |
| 41 | +## How it works |
| 42 | + |
| 43 | +* `eval_aime.py` slices the **Maxwell‑Jia/AIME_2024** dataset to twenty problems for fast feedback. You can change the slice in one line. |
| 44 | +* The script sends model calls in parallel via `ThreadPoolExecutor`, so network latency is hidden. |
| 45 | +* Every five completed items, the script logs progress and elapsed time. |
| 46 | +* The final line `accuracy: value` is the only part Weco needs for guidance. |
0 commit comments