Skip to content

Support first-class GPU allocation for ROCK sandboxes #657

@shamanez

Description

@shamanez

Title

Support first-class GPU allocation for ROCK sandboxes

Problem

ROCK can start sandbox containers, but GPU access is not currently a first-class concept across the full stack. In practice, users who need GPU-bound execution inside sandboxes have to patch the server-side Docker launch path manually.

This is limiting for:

  • agentic evaluation where tools inside the sandbox need CUDA
  • GPU-accelerated code execution or tests inside sandboxed repos
  • mixed CPU/GPU sandbox fleets
  • deterministic per-sandbox GPU assignment in multi-sandbox runs

Requested Capability

Add end-to-end GPU support for sandboxes across:

  • SDK request model
  • admin API
  • runtime deployment layer
  • scheduler / placement layer
  • operator-specific backends

Proposed API Shape

Examples of the sort of fields that would be useful:

  • enable_gpu_passthrough: bool
  • gpu_count: int | None
  • gpu_device_request: str | None
  • gpu_allocation_mode: Literal["fixed", "round_robin"]

These should ideally be available:

  • in SDK SandboxConfig
  • in admin SandboxStartRequest
  • in deployment config objects

Expected Behavior

  • request all GPUs or a specific count
  • optionally request explicit device ids
  • respect pre-existing docker_args / operator overrides
  • support deterministic multi-sandbox allocation
  • fail clearly when host GPU runtime is unavailable
  • expose the effective GPU assignment in sandbox status / logs

Backend Considerations

Docker

  • map request to docker run --gpus ...
  • set appropriate visibility env vars when assignment is specific

Ray

  • reserve GPU-capable placement resources, not just CPU/memory

Kubernetes

  • map requests into pod resource requests/limits or template selection

Why This Matters

Without first-class support, local patches can make sandbox GPU behavior work in one deployment but not in a portable or upstreamable way. A supported API would let ROLL and other ROCK users request GPU-capable sandboxes predictably and safely.

Current Workaround

A server-side workaround can be implemented by extending ROCK runtime config and Docker launch logic, but that still leaves the SDK/API and scheduler layers unaware of GPU requirements. That is useful as an interim step, but not a complete solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions