Skip to content

Commit b0f18f5

Browse files
authored
add blog to exp-bench
1 parent 96a0ed3 commit b0f18f5

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

source/_data/SymbioticLab.bib

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2096,7 +2096,7 @@ @Article{expbench:arxiv25
20962096
publist_confkey = {arXiv:2502.16069},
20972097
publist_link = {paper || https://arxiv.org/abs/2505.24785},
20982098
publist_link = {code || https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench},
2099-
publist_link = {blog || https://www.just-curieous.com},
2099+
publist_link = {blog || https://www.just-curieous.com/machine-learning/research/2025-06-11-exp-bench-can-ai-conduct-ai-research-experiments.html},
21002100
publist_topic = {Systems + AI},
21012101
publist_abstract = {
21022102
Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading LLM-based agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness occasionally reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments. EXP-Bench is open-sourced at https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.

0 commit comments

Comments
 (0)