-
Notifications
You must be signed in to change notification settings - Fork 54
Support first-class GPU allocation for ROCK sandboxes #657
Description
Title
Support first-class GPU allocation for ROCK sandboxes
Problem
ROCK can start sandbox containers, but GPU access is not currently a first-class concept across the full stack. In practice, users who need GPU-bound execution inside sandboxes have to patch the server-side Docker launch path manually.
This is limiting for:
- agentic evaluation where tools inside the sandbox need CUDA
- GPU-accelerated code execution or tests inside sandboxed repos
- mixed CPU/GPU sandbox fleets
- deterministic per-sandbox GPU assignment in multi-sandbox runs
Requested Capability
Add end-to-end GPU support for sandboxes across:
- SDK request model
- admin API
- runtime deployment layer
- scheduler / placement layer
- operator-specific backends
Proposed API Shape
Examples of the sort of fields that would be useful:
enable_gpu_passthrough: boolgpu_count: int | Nonegpu_device_request: str | Nonegpu_allocation_mode: Literal["fixed", "round_robin"]
These should ideally be available:
- in SDK
SandboxConfig - in admin
SandboxStartRequest - in deployment config objects
Expected Behavior
- request
allGPUs or a specific count - optionally request explicit device ids
- respect pre-existing
docker_args/ operator overrides - support deterministic multi-sandbox allocation
- fail clearly when host GPU runtime is unavailable
- expose the effective GPU assignment in sandbox status / logs
Backend Considerations
Docker
- map request to
docker run --gpus ... - set appropriate visibility env vars when assignment is specific
Ray
- reserve GPU-capable placement resources, not just CPU/memory
Kubernetes
- map requests into pod resource requests/limits or template selection
Why This Matters
Without first-class support, local patches can make sandbox GPU behavior work in one deployment but not in a portable or upstreamable way. A supported API would let ROLL and other ROCK users request GPU-capable sandboxes predictably and safely.
Current Workaround
A server-side workaround can be implemented by extending ROCK runtime config and Docker launch logic, but that still leaves the SDK/API and scheduler layers unaware of GPU requirements. That is useful as an interim step, but not a complete solution.