Skip to content

Commit 1f071ba

Browse files
Add RE-Bench Plot
1 parent cce117c commit 1f071ba

File tree

1 file changed

+12
-0
lines changed

1 file changed

+12
-0
lines changed

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,19 @@ weco --source examples/spaceship-titanic/optimize.py \
198198

199199
---
200200

201+
### Performance & Expectations
201202

203+
Weco, powered by the AIDE algorithm, optimizes code iteratively based on your evaluation results. Achieving significant improvements, especially on complex research-level tasks, often requires substantial exploration time.
204+
205+
The following plot from the independent [Research Engineering Benchmark (RE-Bench)](https://metr.org/AI_R_D_Evaluation_Report.pdf) report shows the performance of AIDE (the algorithm behind Weco) on challenging ML research engineering tasks over different time budgets.
206+
<p align="center">
207+
<img src="https://github.com/user-attachments/assets/ff0e471d-2f50-4e2d-b718-874862f533df" alt="RE-Bench Performance Across Time" width="60%"/>
208+
</p>
209+
*(Source: METR RE-Bench, Figure 5. AIDE (o1-preview) vs. Human Expert Percentiles)*
210+
211+
As shown, AIDE demonstrates strong performance gains over time, surpassing lower human expert percentiles within hours and continuing to improve. This highlights the potential of evaluation-driven optimization but also indicates that reaching high levels of performance comparable to human experts on difficult benchmarks can take considerable time (tens of hours in this specific benchmark, corresponding to many `--steps` in the Weco CLI). Factor this into your planning when setting the number of `--steps` for your optimization runs.
212+
213+
---
202214

203215
### Important Note on Evaluation
204216

0 commit comments

Comments
 (0)