|
2 | 2 |
|
3 | 3 | This repo contains the dataset and code for the paper "PaperBench: Evaluating AI's Ability to Replicate AI Research". |
4 | 4 |
|
| 5 | +## Leaderboard |
| 6 | + |
| 7 | +### PaperBench Results |
| 8 | + |
| 9 | +| Agent | Score (%) | # runs | Date | |
| 10 | +|-----------------------------------------|------------|--------|------------| |
| 11 | +| IterativeAgent o1-high (36h limit) | 26.0 ± 0.3 | 3 | 2025-04-02 | |
| 12 | +| IterativeAgent o1-high (24h limit) | 24.4 ± 0.7 | 3 | 2025-04-02 | |
| 13 | +| BasicAgent claude-3.5-sonnet | 21.0 ± 0.8 | 3 | 2025-04-02 | |
| 14 | +| IterativeAgent claude-3.5-sonnet | 16.1 ± 0.1 | 3 | 2025-04-02 | |
| 15 | +| BasicAgent o1-high | 13.2 ± 0.3 | 3 | 2025-04-02 | |
| 16 | +| IterativeAgent o3-mini-high | 8.5 ± 0.8 | 3 | 2025-04-02 | |
| 17 | +| BasicAgent deepseek-r1 | 6.0 ± 0.3 | 3 | 2025-04-02 | |
| 18 | +| BasicAgent gpt-4o | 4.1 ± 0.1 | 3 | 2025-04-02 | |
| 19 | +| BasicAgent gemini-2.0-flash | 3.2 ± 0.2 | 3 | 2025-04-02 | |
| 20 | +| BasicAgent o3-mini-high | 2.6 ± 0.2 | 3 | 2025-04-02 | |
| 21 | + |
| 22 | +### PaperBench Code-Dev Results |
| 23 | + |
| 24 | +| Agent | Score (%) | # runs | Date | |
| 25 | +|-----------------------------|------------|--------|------------| |
| 26 | +| IterativeAgent o1-high | 43.4 ± 0.8 | 3 | 2025-04-02 | |
| 27 | + |
| 28 | +## Introduction |
| 29 | + |
5 | 30 | PaperBench evaluates AI agents on replicating 20 Spotlight and Oral papers from ICML 2024 from scratch. |
6 | 31 |
|
7 | 32 | Each sample of PaperBench includes a research paper and a rubric that defines the requirements for a successful replication. |
@@ -93,7 +118,7 @@ You will need to build the images for each agent that you want to run. We provid |
93 | 118 | - [paperbench/agents/dummy/Dockerfile](paperbench/agents/dummy/Dockerfile): A agent for that creates a dummy submission, useful for testing the eval end-to-end. |
94 | 119 | - [paperbench/agents/aisi-basic-agent/Dockerfile](paperbench/agents/aisi-basic-agent/Dockerfile): Simple ReAct style agents with tools available to them. |
95 | 120 |
|
96 | | -For convenience, we’ve provided a [script](paperbench/scripts/build-docker-images.sh) that builds all the above images: |
| 121 | +For convenience, we've provided a [script](paperbench/scripts/build-docker-images.sh) that builds all the above images: |
97 | 122 |
|
98 | 123 | ```bash |
99 | 124 | bash paperbench/scripts/build-docker-images.sh |
@@ -187,13 +212,13 @@ runs/ |
187 | 212 |
|
188 | 213 | ## PaperBench Code-Dev |
189 | 214 |
|
190 | | -**PaperBench Code-Dev** is a lighter-weight variant of PaperBench. Unlike the full PaperBench pipeline -- which involves executing the agent’s submission in a separate reproduction step -- PaperBench Code-Dev skips the reproduction step and only grades the agent's submission on the **Code Development** requirements. This means: |
| 215 | +**PaperBench Code-Dev** is a lighter-weight variant of PaperBench. Unlike the full PaperBench pipeline -- which involves executing the agent's submission in a separate reproduction step -- PaperBench Code-Dev skips the reproduction step and only grades the agent's submission on the **Code Development** requirements. This means: |
191 | 216 |
|
192 | | -- The Judge only checks **Code Development** requirements (e.g., “Is there an implementation of method X?”). It skips checking Execution requirements that check that the code runs correctly, and skips checking Result Match requirements that check that the paper’s empirical results have been replicated. |
193 | | -- You **don’t need a GPU to run the reproduction step** where the agent's submission is executed. This often reduces cost and runtime significantly. |
| 217 | +- The Judge only checks **Code Development** requirements (e.g., "Is there an implementation of method X?"). It skips checking Execution requirements that check that the code runs correctly, and skips checking Result Match requirements that check that the paper's empirical results have been replicated. |
| 218 | +- You **don't need a GPU to run the reproduction step** where the agent's submission is executed. This often reduces cost and runtime significantly. |
194 | 219 | - There is **less of a need to make a GPU available to the agent** when it is creating its submission. Although having access to a GPU is helpful for the agent to run intensive experiments that verify that its code is correct, the agent can get away with less end-to-end testing of its code since it is only graded on **Code Development** requirements. |
195 | 220 |
|
196 | | -We think PaperBench Code-Dev offers a convenient, lower-cost, but less rigorous way of assessing paper replication. It doesn't require GPUs and typically cuts grading costs (we’ve seen around an 85% reduction in o3-mini SimpleJudge costs for the average submission), making it a accessible alternative for assessing models' abilities to replicate papers. |
| 221 | +We think PaperBench Code-Dev offers a convenient, lower-cost, but less rigorous way of assessing paper replication. It doesn't require GPUs and typically cuts grading costs (we've seen around an 85% reduction in o3-mini SimpleJudge costs for the average submission), making it a accessible alternative for assessing models' abilities to replicate papers. |
197 | 222 |
|
198 | 223 | To run the Code-Dev variant, simply include the following flag: |
199 | 224 |
|
|
0 commit comments