Skip to content

Commit ecc07bd

Browse files
authored
feat: add sandbox logic (#624)
<!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added sandbox backend supporting AWS ECS Fargate infrastructure with session management. * Enabled interactive command execution and file upload capabilities within sandboxes. * Integrated AWS credential validation and comprehensive error handling. * **Documentation** * Added API documentation for the sandbox module and ECS Fargate usage. * **Tests** * Added unit tests for sandbox module behavior and edge cases. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Michal Bien <mbien@nvidia.com>
1 parent 8484c1d commit ecc07bd

File tree

6 files changed

+1846
-0
lines changed

6 files changed

+1846
-0
lines changed

docs/references/api/nemo-evaluator/api/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,5 +45,6 @@ The central point of evaluation is ``evaluate()`` function that takes standarize
4545

4646
api-dataclasses
4747
nemo-evaluator.adapters <../adapters/adapters>
48+
nemo-evaluator.sandbox <../sandbox/index>
4849

4950

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
``nemo_evaluator.sandbox``
2+
======================================
3+
4+
Sandbox implementations used by evaluation harnesses that need a tmux-like interactive session.
5+
6+
This module is designed to keep dependencies **optional**:
7+
8+
- The ECS Fargate implementation only imports AWS SDKs (``boto3``/``botocore``) when actually used.
9+
- Using the ECS sandbox also requires the AWS CLI (``aws``) and ``session-manager-plugin`` on the host.
10+
11+
Usage (ECS Fargate)
12+
-------------------
13+
14+
Typical usage is:
15+
16+
- configure :class:`~nemo_evaluator.sandbox.ecs_fargate.EcsFargateConfig`
17+
- :meth:`~nemo_evaluator.sandbox.ecs_fargate.EcsFargateSandbox.spin_up` a sandbox context
18+
- create an interactive :class:`~nemo_evaluator.sandbox.base.NemoSandboxSession`
19+
20+
Example::
21+
22+
from nemo_evaluator.sandbox import EcsFargateConfig, EcsFargateSandbox
23+
24+
cfg = EcsFargateConfig(
25+
region="us-west-2",
26+
cluster="my-ecs-cluster",
27+
task_definition="my-task-def:1",
28+
container_name="eval",
29+
subnets=["subnet-abc"],
30+
security_groups=["sg-xyz"],
31+
s3_bucket="my-staging-bucket",
32+
)
33+
34+
with EcsFargateSandbox.spin_up(
35+
cfg=cfg,
36+
task_id="task-001",
37+
trial_name="trial-0001",
38+
run_id="run-2026-01-12",
39+
) as sandbox:
40+
session = sandbox.create_session("main")
41+
session.send_keys(["echo hello", "Enter"], block=True)
42+
print(session.capture_pane())
43+
44+
Prerequisites / Notes
45+
---------------------
46+
47+
- The harness host must have **AWS CLI** and **session-manager-plugin** installed.
48+
- If you use S3-based fallbacks (large uploads / long commands), configure ``s3_bucket``.
49+
50+
.. automodule:: nemo_evaluator.sandbox
51+
:members:
52+
:undoc-members:
53+
:member-order: bysource
54+
55+
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
17+
from .base import NemoEvaluatorSandbox, NemoSandboxCommand, NemoSandboxSession
18+
from .ecs_fargate import (
19+
AwsCliMissingError,
20+
EcsExecError,
21+
EcsFargateConfig,
22+
EcsFargateSandbox,
23+
)
24+
25+
__all__ = [
26+
"NemoEvaluatorSandbox",
27+
"NemoSandboxCommand",
28+
"NemoSandboxSession",
29+
"AwsCliMissingError",
30+
"EcsExecError",
31+
"EcsFargateConfig",
32+
"EcsFargateSandbox",
33+
]
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
17+
from __future__ import annotations
18+
19+
from abc import ABC, abstractmethod
20+
from dataclasses import dataclass
21+
from pathlib import Path
22+
from typing import ContextManager, Iterable, Protocol, runtime_checkable
23+
24+
25+
@dataclass(frozen=True)
26+
class NemoSandboxCommand:
27+
"""
28+
TB-independent command model for driving an interactive terminal.
29+
30+
Mirrors the fields terminal-bench agents commonly use (but does not depend on TB).
31+
"""
32+
33+
command: str
34+
min_timeout_sec: float = 0.0
35+
max_timeout_sec: float = 180.0
36+
block: bool = False
37+
append_enter: bool = True
38+
39+
40+
@runtime_checkable
41+
class NemoSandboxSession(Protocol):
42+
"""
43+
Minimal session API used by agents/harnesses (tmux-like).
44+
"""
45+
46+
def send_keys(
47+
self,
48+
keys: str | list[str],
49+
block: bool = False,
50+
min_timeout_sec: float = 0.0,
51+
max_timeout_sec: float = 180.0,
52+
) -> None: ...
53+
54+
def send_command(self, command: NemoSandboxCommand) -> None: ...
55+
56+
def capture_pane(self, capture_entire: bool = False) -> str: ...
57+
58+
def is_session_alive(self) -> bool: ...
59+
60+
def get_incremental_output(self) -> str: ...
61+
62+
def get_asciinema_timestamp(self) -> float: ...
63+
64+
def copy_to_sandbox(
65+
self,
66+
paths: list[Path] | Path,
67+
container_dir: str | None = None,
68+
container_filename: str | None = None,
69+
) -> None: ...
70+
71+
72+
class NemoEvaluatorSandbox(ABC):
73+
"""
74+
Abstract factory for evaluator sandboxes.
75+
76+
Implementations are responsible for provisioning an isolated environment and exposing
77+
a tmux-like session API for agents to interact with it.
78+
"""
79+
80+
@classmethod
81+
@abstractmethod
82+
def spin_up(
83+
cls,
84+
*,
85+
task_id: str,
86+
trial_name: str,
87+
run_id: str,
88+
pre_upload_paths: Iterable[Path] | None = None,
89+
upload_dest_dir: str | None = None,
90+
**kwargs,
91+
) -> ContextManager["NemoEvaluatorSandbox"]:
92+
raise NotImplementedError
93+
94+
@abstractmethod
95+
def create_session(
96+
self,
97+
session_name: str,
98+
is_active_stream: bool = False,
99+
as_configured_user: bool = True,
100+
) -> NemoSandboxSession:
101+
raise NotImplementedError
102+
103+
@abstractmethod
104+
def copy_to_sandbox(
105+
self,
106+
*,
107+
paths: list[Path] | Path,
108+
container_dir: str | None = None,
109+
container_filename: str | None = None,
110+
) -> None:
111+
raise NotImplementedError
112+
113+
@abstractmethod
114+
def stop(self) -> None:
115+
raise NotImplementedError

0 commit comments

Comments
 (0)