Skip to content

Commit 40c3971

Browse files
scotwellsclaude
andcommitted
docs: add Agent Sandboxes enhancement proposal
Introduces the product-level proposal for Agent Sandboxes: a new compute experience for AI agents that need isolated, ready-to-use environments on demand. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 0a6eb1b commit 40c3971

File tree

1 file changed

+243
-0
lines changed

1 file changed

+243
-0
lines changed
Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
# Agent Sandboxes
2+
3+
Status: Proposal
4+
Audience: Product, platform operators, prospective customers
5+
Date: 2026-04-08
6+
7+
## What we're building
8+
9+
A new Datum compute experience purpose-built for **AI agents that need
10+
their own isolated, ready-to-use environment**. Instead of asking users to
11+
assemble a workload, pick regions, and configure scaling, we're shipping a
12+
small set of resources that let anyone — or any agent — say:
13+
14+
> "Give me a copy of the Python data-science sandbox, just for this session."
15+
16+
…and get a fully initialized, isolated environment back in the time it takes
17+
to load a web page.
18+
19+
We call the new capability **Agent Sandboxes**.
20+
21+
## Why we're building it
22+
23+
The teams building AI agents today are stuck between two bad options:
24+
25+
- **Run everything in one shared container.** Cheap and fast, but anything
26+
the agent does — installing packages, executing model-generated code,
27+
touching files — leaks into the next session. One bad command can break
28+
the environment for everyone.
29+
- **Spin up a full cloud workload per session.** Properly isolated, but
30+
slow to start, expensive to keep idle, and wildly over-engineered for
31+
"I just need a Python process for the next 20 minutes." Users have to
32+
learn deployment concepts that have nothing to do with their problem.
33+
34+
Neither option fits how agents actually work. An agent platform typically
35+
needs **many small, short-lived, strongly isolated environments**, created
36+
on demand, often hundreds or thousands per day, with state that survives a
37+
pause and disappears when the session ends.
38+
39+
This is also where the broader ecosystem is heading. The Kubernetes
40+
community has started a project — [agent-sandbox][upstream] — specifically
41+
to standardize this shape of compute. Datum is well-positioned to offer a
42+
best-in-class version of it: our underlying infrastructure (Unikraft-based
43+
microVMs with snapshot/restore) makes "instant, isolated, stateful" the
44+
default rather than the exception.
45+
46+
The product opportunity is to turn agent sandboxes into a **catalog Datum
47+
ships and curates**, the way cloud providers ship machine images — except
48+
allocation is measured in milliseconds and idle copies cost almost nothing.
49+
50+
## What becomes available
51+
52+
Three new resources, layered so that the simple case stays simple and the
53+
advanced case stays possible.
54+
55+
### 1. `Sandbox` — one isolated environment
56+
57+
The core building block. A `Sandbox` represents a single, isolated, stateful
58+
environment running one image. It has a stable name, stable network address,
59+
and persistent storage that survives pause and resume.
60+
61+
This is the lowest-level resource. Most users never touch it directly.
62+
63+
### 2. `SandboxTemplate` — a reusable, curated environment definition
64+
65+
A **named, versioned blueprint** for a kind of sandbox: which image to run,
66+
how much CPU/memory, which ports to expose, how long to keep it warm, what
67+
isolation level to use. Datum ships and maintains a catalog of these out
68+
of the box — `python-agent-runtime`, `node-agent-runtime`, `code-interpreter`,
69+
`headless-browser`, `jupyter-datascience`, and so on. Customers and partners
70+
can publish their own templates into their own namespaces using the same
71+
mechanism.
72+
73+
Templates are the product surface. They are what users browse, pick, and
74+
build against.
75+
76+
### 3. `SandboxClaim` — "give me a copy of that template"
77+
78+
The user-facing request. A `SandboxClaim` says "I want a fresh sandbox based
79+
on template X, with these small overrides." The platform produces a per-claim
80+
`Sandbox` that is a fully independent copy — its own identity, its own
81+
storage, its own lifecycle.
82+
83+
A claim is typically 5–10 lines and can be created by an agent without any
84+
documentation. It is the resource an agent platform creates per session,
85+
per user, or per task.
86+
87+
### Behind the scenes: warm pools
88+
89+
Each `SandboxTemplate` can keep a pool of pre-initialized copies ready to
90+
go. When a claim arrives, the platform hands out a warm copy and refills
91+
the pool in the background. The user sees a sub-second allocation; the
92+
operator sees a tunable knob on the template.
93+
94+
Warm pools are not a separate resource the user or operator has to manage —
95+
they're a property of the template.
96+
97+
## How this fits the existing platform
98+
99+
Datum already has a `Workload` resource for declarative, multi-region,
100+
horizontally scaled applications. `Workload` is and remains the right tool
101+
for production services. Agent sandboxes are a *different* shape of compute:
102+
103+
| | **Workload** | **Agent Sandbox** |
104+
|---|---|---|
105+
| Cardinality | Many replicas across regions | One environment per session |
106+
| Lifetime | Long-running | Minutes to hours, then gone |
107+
| Scaling | Horizontal, automatic | None — each sandbox is its own unit |
108+
| State | Usually external (DB, cache) | Local, persistent across pause |
109+
| Allocation time | Seconds to minutes | Milliseconds (from warm pool) |
110+
| Who creates it | A human, once | An agent, thousands of times |
111+
112+
The two live side-by-side. We are not replacing `Workload`; we are adding
113+
the right primitive for the use case it was never designed for.
114+
115+
As part of this work, the underlying repository is being renamed from
116+
`workload-operator` to **`compute`** to reflect that it now owns
117+
more than one top-level concept on the Datum compute platform.
118+
119+
---
120+
121+
## User journeys
122+
123+
### Journey A — The agent platform (consumer)
124+
125+
**Persona.** Maya is building an AI coding assistant. When a user asks her
126+
agent to "analyze this CSV and plot the results," the agent needs to write
127+
and execute Python in a fresh, isolated environment, then throw it away.
128+
129+
**Today, without agent sandboxes.** Maya stands up a Kubernetes cluster,
130+
writes a custom controller that creates pods per session, figures out how
131+
to give each pod its own storage, builds a queue of pre-warmed pods to
132+
hide cold starts, writes a janitor to clean up dead sessions, and worries
133+
constantly about whether one user's `pip install` can affect another's.
134+
Months of work before her agent runs its first line of code in production.
135+
136+
**With agent sandboxes.**
137+
138+
1. Maya browses the Datum sandbox catalog and picks `python-data-science`.
139+
She reads the one-page description: Python 3.12, pandas, numpy,
140+
matplotlib pre-installed, 2 GB RAM, 10 GB scratch disk, isolated per copy.
141+
2. In her agent code, when a session starts, she creates a `SandboxClaim`
142+
referencing that template. Five lines of YAML, or one API call.
143+
3. Within tens of milliseconds, the claim reports `Ready` with an endpoint
144+
her agent can connect to. The environment is fully initialized — the
145+
Python interpreter is warm, libraries are loaded, ready for the first
146+
command.
147+
4. Her agent uses the sandbox: writing files, executing code, generating
148+
plots. Everything stays inside that one copy.
149+
5. When the user's session ends — or after 15 minutes of inactivity —
150+
the sandbox is deleted. Storage goes with it. No cleanup code on
151+
Maya's side.
152+
6. If Maya wants something not in the catalog, she pushes her own image
153+
and Datum builds a custom template for her in her own namespace. The
154+
per-claim experience is identical.
155+
156+
**What Maya never has to think about:** regions, scaling, image building,
157+
warm pools, cluster sizing, isolation backends, snapshot management,
158+
garbage collection, or the difference between a "container" and a "VM."
159+
160+
### Journey B — The internal team (operator)
161+
162+
**Persona.** Devon is on the Datum platform team. He owns the catalog of
163+
sandbox templates Datum ships to customers. A new team has asked for a
164+
`headless-browser` sandbox for agents that need to scrape and screenshot
165+
web pages.
166+
167+
**With agent sandboxes.**
168+
169+
1. Devon writes a Dockerfile for the headless-browser environment:
170+
Chromium, Playwright, a small HTTP wrapper. Standard stuff.
171+
2. He creates a `SandboxTemplate` in the Datum catalog namespace pointing
172+
at that image. He sets resource sizing, the ports to expose, a default
173+
idle timeout of 10 minutes, and a warm pool size of 10.
174+
3. Datum's build pipeline picks up the new template, builds the image
175+
for the appropriate isolation backend, validates it, and starts the
176+
warm pool. Devon watches the template's status go from `Building` to
177+
`Ready`.
178+
4. Devon runs a few test claims against it, confirms the browser works,
179+
sets the template to `Published`. It now appears in the customer-facing
180+
catalog.
181+
5. A week later, traffic has grown. Devon raises the warm pool from 10
182+
to 50 by editing one field on the template. No customer change needed.
183+
6. A security advisory drops for Chromium. Devon publishes
184+
`headless-browser:1.1` as a new template version. New claims get the
185+
patched version automatically; existing live sandboxes keep running on
186+
the old version until their sessions end. No fleet-wide restart.
187+
7. Datum's billing and observability surfaces show per-template usage:
188+
how many claims, how long they live, how often the warm pool runs dry,
189+
how much storage they consume. Devon uses this to right-size the pool
190+
and report ROI.
191+
192+
**What Devon never has to think about:** writing a controller, managing
193+
pods, hand-rolling a warm-pool scheduler, building a per-copy storage
194+
system, or coordinating rollouts across regions.
195+
196+
### Journey C — The end customer of the agent (incidental)
197+
198+
**Persona.** Priya is using Maya's coding assistant. She doesn't know what
199+
Datum is and never will.
200+
201+
What she experiences: she asks the agent to do something. The agent
202+
responds in roughly the same time it would take any chatbot. Behind the
203+
scenes, a sandbox was claimed, used, paused, and cleaned up — but to
204+
Priya, it just felt like the assistant worked. Her data didn't leak into
205+
anyone else's session, and the assistant didn't get slower as more people
206+
used it.
207+
208+
That invisible reliability is the actual product.
209+
210+
---
211+
212+
## What success looks like
213+
214+
- **Time-to-first-sandbox** for a new agent platform: under one hour from
215+
signup, with no infrastructure code written.
216+
- **Claim-to-ready latency** against a catalog template: under 50 ms at
217+
the 95th percentile.
218+
- **Idle cost** of a paused sandbox: an order of magnitude lower than
219+
a comparable always-on container.
220+
- **Catalog breadth**: Datum ships at least the top 5 agent runtimes
221+
(Python, Node, code interpreter, headless browser, notebook) in the
222+
initial release, with a clear path for customer-published templates.
223+
- **Operator ergonomics**: a new sandbox template can be added to the
224+
Datum catalog by one engineer in under a day.
225+
226+
## Open product questions
227+
228+
- Which templates ship in the launch catalog, and in what order?
229+
- What is the pricing shape — per claim, per active sandbox-minute,
230+
per warm-pool slot, or some combination?
231+
- Do we expose customer-published templates in v1, or hold them for v2?
232+
- How do we surface template versioning and deprecation to consumers
233+
who may have thousands of live claims at any moment?
234+
235+
## What's *not* in scope
236+
237+
- Replacing `Workload` for long-running, multi-region production services.
238+
- A general-purpose VM or container product. Agent sandboxes are
239+
opinionated on purpose: one image, one copy, one session.
240+
- A development IDE or notebook UI. Datum provides the runtime; the
241+
agent platform or developer tool provides the experience on top.
242+
243+
[upstream]: https://github.com/kubernetes-sigs/agent-sandbox

0 commit comments

Comments
 (0)