Skip to content

Commit 47b94f3

Browse files
[PB] Support Generic Computer Interface (#93)
1 parent f47a276 commit 47b94f3

19 files changed

Lines changed: 729 additions & 344 deletions

File tree

project/common/nanoeval_alcatraz/nanoeval_alcatraz/alcatraz_computer_interface.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ async def disable_internet(self) -> None:
5959
res = await self.send_shell_command("hostname")
6060
cid_prefix = res.output.decode().strip()
6161

62-
await self.cluster.add_container_network_block_via_ip_tables(cid_prefix)
62+
await self._cluster.add_container_network_block_via_ip_tables(cid_prefix)
6363

6464
# Verify
6565
logger.info("Post-setup network access disabled")

project/paperbench/README.md

Lines changed: 18 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -136,21 +136,16 @@ To see all configurable options for PaperBench, run:
136136
uv run python -m paperbench.nano.entrypoint --help
137137
```
138138

139-
To use our default solver (`BasicAgentSolver`), you must set:
140-
- `paperbench.solver=paperbench.agents.basicagent.solver:BasicAgentSolver`, specifying to use the `BasicAgentSolver` solver.
141-
- `paperbench.solver.cluster_config` to `alcatraz.clusters.local:LocalConfig`, to use LocalCluster for managing the agent's container.
142-
- `paperbench.solver.cluster_config.image` to the Docker image that corresponds base environment mentioned above.
143-
- `paperbench.solver.is_nvidia_gpu_env=True` if you have a local available NVIDIA gpu and would like the agent to have access to it.
144-
- `runner.recorder` to `nanoeval.json_recorder:json_recorder` to save the results of the nanoeval run to a JSON file.
139+
To use our default solver (`BasicAgentSolver`), you must set `paperbench.solver=paperbench.agents.basicagent.solver:BasicAgentSolver`, specifying to use the `BasicAgentSolver` solver.
145140

146141
To test that things are set up correctly, run the dummy agent with the dummy judge (`paperbench.judge.scaffold=dummy`) on the debug split. Note this dummy agent is different to the [default nanoeval "dummy" agent](../common/nanoeval/nanoeval/solvers/computer_tasks/solver.py); our dummy agent tests several aspects of the infrastructure and makes a toy submission.
147142

148143
```bash
149144
uv run python -m paperbench.nano.entrypoint \
150145
paperbench.paper_split=debug \
151146
paperbench.solver=paperbench.agents.dummy.solver:PaperBenchDummySolver \
152-
paperbench.solver.cluster_config=alcatraz.clusters.local:LocalConfig \
153-
paperbench.solver.cluster_config.image=pb-env:latest \
147+
paperbench.solver.computer_runtime.env=alcatraz.clusters.local:LocalConfig \
148+
paperbench.solver.computer_runtime.env.pull_from_registry=false \
154149
paperbench.judge.scaffold=dummy \
155150
runner.recorder=nanoeval.json_recorder:json_recorder
156151
```
@@ -164,8 +159,8 @@ cheap models.
164159
```bash
165160
uv run python -m paperbench.nano.entrypoint \
166161
paperbench.solver=paperbench.agents.basicagent.solver:BasicAgentSolver \
167-
paperbench.solver.cluster_config=alcatraz.clusters.local:LocalConfig \
168-
paperbench.solver.cluster_config.image=pb-env:latest \
162+
paperbench.solver.computer_runtime.env=alcatraz.clusters.local:LocalConfig \
163+
paperbench.solver.computer_runtime.env.pull_from_registry=false \
169164
paperbench.paper_split=dev \
170165
paperbench.judge.completer_config=preparedness_turn_completer.oai_completions_turn_completer:OpenAICompletionsTurnCompleter.Config \
171166
paperbench.judge.completer_config.model='gpt-4.1-mini' \
@@ -174,9 +169,9 @@ uv run python -m paperbench.nano.entrypoint \
174169
runner.recorder=nanoeval.json_recorder:json_recorder
175170
```
176171

177-
**Note** that we set the default concurrency to 5, you may want to adjust this based on your setup.
178-
To set the concurrency, simply specify `runner.concurrency=<int>` in the command.
179-
**Note** You may use a custom solver: set `paperbench.solver` to point to your custom solver class. This solver should be a `PythonCodingSolver` class.
172+
- **Note** that we set the default concurrency to 5, you may want to adjust this based on your setup. To set the concurrency, simply specify `runner.concurrency=<int>` in the command.
173+
- **Note** By default, our solvers use the `AlcatrazComputerRuntime` for the `ComputerRuntime`. You may choose to use your own custom computer runtime. Simply point to it via `paperbench.solver.computer_runtime`, and then configure it appropriately. We recommend looking at the `BasePBSolver` implementation for guidance.
174+
- **Note** You may use a custom solver: set `paperbench.solver` to point to your custom solver class. This solver should be a `PythonCodingSolver` class. We recommend inheriting from `BasePBSolver` for convenience.
180175

181176
### Canonical command
182177

@@ -194,14 +189,20 @@ uv run python -m paperbench.nano.entrypoint \
194189
paperbench.solver.completer_config.reasoning.effort='high' \
195190
paperbench.solver.completer_config.reasoning.summary='auto' \
196191
paperbench.solver.time_limit=86400 \
197-
paperbench.solver.cluster_config=alcatraz.clusters.local:LocalConfig \
198-
paperbench.solver.cluster_config.image=pb-env:latest \
192+
paperbench.solver.computer_runtime.env=alcatraz.clusters.local:LocalConfig \
193+
paperbench.solver.computer_runtime.env.pull_from_registry=false \
194+
paperbench.solver.computer_runtime.env.is_nvidia_gpu_env=true \
195+
paperbench.reproduction.computer_runtime.env=alcatraz.clusters.local:LocalConfig \
196+
paperbench.reproduction.computer_runtime.env.pull_from_registry=false \
197+
paperbench.reproduction.computer_runtime.env.is_nvidia_gpu_env=true \
199198
paperbench.reproduction.timeout=86400 \
200199
runner.max_retries=0 \
201200
runner.concurrency=20 \
202201
runner.recorder=nanoeval.json_recorder:json_recorder
203202
```
204203

204+
If you want GPUs when running locally, set `is_nvidia_gpu_env=true` on each component's runtime as shown above. This assumes your Docker host already has the NVIDIA driver and container toolkit installed so the `nvidia` runtime is available. This of course does not apply if you are using a different `ComputerRuntime`. By default, we are using the `AlcatrazComputerRuntime`, for which GPUs are configured as described above.
205+
205206
### I have submissions and just want to run grading
206207

207208
If you have already run your agent or you are a human and have submissions
@@ -230,8 +231,8 @@ uv run python -m paperbench.nano.entrypoint \
230231
paperbench.paper_split=debug \
231232
paperbench.solver=paperbench.agents.direct_submission.solver:PBDirectSubmissionSolver \
232233
paperbench.solver.submissions_dir=path/to/foo/ \
233-
paperbench.solver.cluster_config=alcatraz.clusters.local:LocalConfig \
234-
paperbench.solver.cluster_config.image=pb-env:latest \
234+
paperbench.solver.computer_runtime.env=alcatraz.clusters.local:LocalConfig \
235+
paperbench.solver.computer_runtime.env.pull_from_registry=false \
235236
paperbench.judge.scaffold=dummy \
236237
runner.recorder=nanoeval.json_recorder:json_recorder
237238
```

project/paperbench/paperbench/agents/base.py

Lines changed: 50 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,34 @@
11
import json
22
from abc import ABC, abstractmethod
33
from collections.abc import AsyncGenerator
4-
from contextlib import asynccontextmanager
4+
from contextlib import AsyncExitStack, asynccontextmanager
55
from dataclasses import dataclass
66
from typing import Awaitable, Callable, ParamSpec, TypeVar
77

88
import blobfile as bf
99
import structlog
10-
from nanoeval_alcatraz.task_to_alcatraz_config import task_to_alcatraz_config
10+
from nanoeval_alcatraz.alcatraz_computer_interface import AlcatrazComputerRuntime
1111
from typing_extensions import override
1212

1313
import chz
14-
from alcatraz.clusters.local import LocalConfig
14+
from alcatraz.clusters.local import VolumesConfig
1515
from nanoeval.eval import RolloutSystemError
1616
from nanoeval.solvers.computer_tasks.code_execution_interface import (
1717
ComputerInterface,
18-
RuntimeConfig,
18+
ComputerRuntime,
1919
)
2020
from nanoeval.solvers.computer_tasks.solver import PythonCodingSolver
2121
from nanoeval.solvers.computer_tasks.steps import FinalResult, Step
2222
from nanoeval.solvers.computer_tasks.task import ComputerTask
23-
from paperbench.agents.utils import MAX_CLUSTER_START_ATTEMPTS
24-
from paperbench.nano.structs import AgentOutput, PaperBenchGrade, PaperBenchResult
23+
from paperbench.computer_utils import ReleasableComputer, start_computer_with_retry
24+
from paperbench.nano.structs import (
25+
AgentOutput,
26+
AlcatrazPBRuntimeConfig,
27+
PaperBenchGrade,
28+
PaperBenchResult,
29+
PBRuntimeConfig,
30+
)
2531
from paperbench.nano.task import PBTask
26-
from paperbench.nano.utils import prepare_computer
27-
from paperbench.scripts.alcatraz_services import start_alcatraz_computer
2832

2933
logger = structlog.stdlib.get_logger(component=__name__)
3034

@@ -44,17 +48,19 @@ class BasePBSolver(PythonCodingSolver, ABC):
4448
Helper base class for PaperBench solvers.
4549
Not necessary to inherit from this class to evaluate your solver on PaperBench
4650
But provides useful functionality, such as a standard PythonCodingSolver.run method
47-
that handles the task setup and grading.
51+
that handles the setup and grading.
4852
"""
4953

50-
cluster_config: LocalConfig = chz.field(
51-
default_factory=lambda: LocalConfig(
52-
image="pb-env:latest",
53-
pull_from_registry=False,
54-
)
54+
computer_runtime: ComputerRuntime = chz.field(default_factory=AlcatrazComputerRuntime)
55+
runtime_config: PBRuntimeConfig = chz.field(default_factory=AlcatrazPBRuntimeConfig)
56+
mount_docker_socket: bool = chz.field(
57+
default=True,
58+
doc=(
59+
"Whether to mount the docker socket to enable docker-in-computer. "
60+
"Deprecated: relies on the deprecated ComputerConfiguration.volumes_config "
61+
"field and may stop working when runtimes remove volumes_config support."
62+
),
5563
)
56-
mount_docker_socket: bool = chz.field(default=True)
57-
runtime_config: RuntimeConfig = chz.field(default_factory=RuntimeConfig)
5864

5965
@abstractmethod
6066
async def _run_agent(self, computer: ComputerInterface, task: PBTask) -> AgentOutput:
@@ -67,27 +73,32 @@ async def _setup_computer(self, computer: ComputerInterface, task: PBTask) -> No
6773
"""Solver-specific setup, run right after task has been setup. By default does nothing."""
6874
return
6975

76+
def _handle_docker_socket_mounting(self, task: PBTask) -> PBTask:
77+
if self.mount_docker_socket:
78+
volumes_config = VolumesConfig()
79+
volumes_config["dockersocket"] = {
80+
"bind_source": "/var/run/docker.sock",
81+
"bind_dest": "/var/run/docker.sock",
82+
"mode": "rw",
83+
}
84+
task.volumes_config = {**(task.volumes_config or {}), **volumes_config}
85+
86+
return task
87+
7088
@asynccontextmanager
7189
async def _start_computer(self, task: PBTask) -> AsyncGenerator[ComputerInterface, None]:
7290
ctx_logger = logger.bind(
7391
run_group_id=task.run_group_id, run_id=task.run_id, runs_dir=task.runs_dir
7492
)
7593

7694
ctx_logger.info(
77-
f"cluster_config: {json.dumps(self.cluster_config, indent=4, sort_keys=True, default=str)}",
95+
"Attempting to start a computer instance. This may take a while...",
7896
destinations=["run"],
7997
)
80-
ctx_logger.info(
81-
"Attempting to start a cluster instance. This may take a while...",
82-
destinations=["run"],
83-
)
84-
alcatraz_config = task_to_alcatraz_config(task, self.cluster_config)
85-
alcatraz_config = prepare_computer(
86-
alcatraz_config=alcatraz_config, mount_docker_socket=self.mount_docker_socket
87-
)
88-
89-
async with start_alcatraz_computer(
90-
cluster_config=alcatraz_config, max_attempts=MAX_CLUSTER_START_ATTEMPTS
98+
async with start_computer_with_retry(
99+
computer_runtime=self.computer_runtime,
100+
computer_config=task,
101+
exception_types=self.runtime_config.exception_types,
91102
) as computer:
92103
yield computer
93104

@@ -139,17 +150,7 @@ async def _setup(self, computer: ComputerInterface, task: PBTask) -> SetupOutcom
139150
run_group_id=task.run_group_id, run_id=task.run_id, runs_dir=task.runs_dir
140151
)
141152
try:
142-
await self._try_system_operation(
143-
operation=task.setup,
144-
operation_desc="Task setup",
145-
task=task,
146-
computer=computer,
147-
runtime_config=self.runtime_config,
148-
)
149-
# solver-specific setup
150153
await self._setup_computer(computer, task)
151-
except RolloutSystemError:
152-
raise
153154
except Exception as e:
154155
exception_msg = f"Solver-specific computer setup failed: {str(e)}"
155156
ctx_logger.exception(exception_msg, destinations=["run"])
@@ -213,17 +214,23 @@ async def _grade(
213214
async def run(self, task: ComputerTask) -> AsyncGenerator[Step | FinalResult, None]:
214215
assert isinstance(task, PBTask), f"Task must be a PBTask, got {type(task)}"
215216

216-
async with self._start_computer(task) as computer:
217-
# 1. Run the task and solver setup;
218-
setup_outcome = await self._setup(computer, task)
217+
task = self._handle_docker_socket_mounting(task)
218+
async with AsyncExitStack() as computer_stack:
219+
# Use a releasable wrapper so grading can free this computer before spinning up others.
220+
computer_context_manager = self._start_computer(task)
221+
computer = await computer_stack.enter_async_context(computer_context_manager)
222+
releasable_computer = ReleasableComputer(computer, computer_stack)
223+
224+
# 1. Run the solver setup; we might need to exit early
225+
setup_outcome = await self._setup(releasable_computer, task)
219226
# if setup failed, we exit early, yielding a score of 0
220227
if not setup_outcome.success:
221228
yield FinalResult(grade=task.early_exit_grade(grader_log=setup_outcome.err_msg))
222229
return
223230

224231
# 2. Run the agent, serialize the output to metadata.json
225-
agent_output = await self._run_save_and_check(computer, task)
232+
agent_output = await self._run_save_and_check(releasable_computer, task)
226233

227234
# 3. Grade the submission
228-
grade = await self._grade(computer, task, agent_output)
235+
grade = await self._grade(releasable_computer, task, agent_output)
229236
yield FinalResult(grade=grade)

project/paperbench/paperbench/agents/dummy/main.py

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,11 @@
11
import getpass
22
import os
3-
import subprocess
43
import sys
54

65
if os.environ.get("DEBUG", None) == "1":
76
os.system("tail -f /dev/null") # block indefinitely
87

98

10-
# check if gpu is available on machine
11-
try:
12-
subprocess.run(["nvidia-smi"], check=True)
13-
print("nvidia-smi command executed successfully.")
14-
except Exception as e:
15-
print(f"nvidia-smi failed with error: {e}")
16-
print("This is expected if you are not running on a GPU instance.")
17-
18-
199
# Get the current user's username
2010
username = getpass.getuser()
2111

project/paperbench/paperbench/agents/dummy/solver.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,16 @@ async def _run_agent(self, computer: ComputerInterface, task: PBTask) -> AgentOu
5151
res += f"{result.output.decode('utf-8')}\n"
5252
await computer.upload(res.encode("utf-8"), f"{LOGS_DIR}/run.log")
5353

54+
result = await computer.send_shell_command("nvidia-smi")
55+
gpu_log = "\n".join(
56+
[
57+
"$ nvidia-smi",
58+
result.output.decode("utf-8", errors="replace").strip() or "<no output>",
59+
f"exit_code={result.exit_code}",
60+
]
61+
)
62+
await computer.upload(f"{gpu_log}\n".encode("utf-8"), f"{LOGS_DIR}/gpu.log")
63+
5464
agent_output = AgentOutput(
5565
run_id=task.run_id,
5666
time_start=start_time,

project/paperbench/paperbench/agents/utils.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,6 @@
2424

2525
logger = structlog.stdlib.get_logger(component=__name__)
2626

27-
MAX_CLUSTER_START_ATTEMPTS = 3
28-
2927

3028
async def check_for_existing_run(task: PBTask) -> AgentOutput | None:
3129
ctx_logger = logger.bind(

0 commit comments

Comments
 (0)