Add RE-Bench Plot

ZhengyaoJiang · web-flow · commit 1f071ba81684 · 2025-04-11T11:18:06.000+01:00
diff --git a/README.md b/README.md
@@ -198,7 +198,19 @@ weco --source examples/spaceship-titanic/optimize.py \
 
 ---
 
+### Performance & Expectations
 
+Weco, powered by the AIDE algorithm, optimizes code iteratively based on your evaluation results. Achieving significant improvements, especially on complex research-level tasks, often requires substantial exploration time.
+
+The following plot from the independent [Research Engineering Benchmark (RE-Bench)](https://metr.org/AI_R_D_Evaluation_Report.pdf) report shows the performance of AIDE (the algorithm behind Weco) on challenging ML research engineering tasks over different time budgets.
+<p align="center">
+<img src="https://github.com/user-attachments/assets/ff0e471d-2f50-4e2d-b718-874862f533df" alt="RE-Bench Performance Across Time" width="60%"/>
+</p>
+*(Source: METR RE-Bench, Figure 5. AIDE (o1-preview) vs. Human Expert Percentiles)*
+
+As shown, AIDE demonstrates strong performance gains over time, surpassing lower human expert percentiles within hours and continuing to improve. This highlights the potential of evaluation-driven optimization but also indicates that reaching high levels of performance comparable to human experts on difficult benchmarks can take considerable time (tens of hours in this specific benchmark, corresponding to many `--steps` in the Weco CLI). Factor this into your planning when setting the number of `--steps` for your optimization runs.
+
+---
 
 ### Important Note on Evaluation