Skip to content

Revisit KEP - Warmpool creates Sandboxes vs Pods #390

@tomergee

Description

@tomergee

Following recent activities for performance optimization, benchmarking controllers and community member PRs:
#162
#375

We would like to re-visit the design choice of transitioning warmpools to create full Sandbox objects rather than raw pods.

This design transition is an effective architectural evolution for agent-sandbox, addressing both critical performance bottlenecks and long-term maintainability issues identified in previous iterations.

  1. Architectural Simplification
    Transitioning to "warm sandboxes" effectively refactors the SandboxClaim controller from a complex resource "stitcher" into a streamlined lease negotiator In the current model, the claim process is bottlenecked by the need to create a Sandbox object on-the-fly and then coordinate its association with a pre-existing pod, which often triggers secondary reconciliations and increases the surface area for race conditions. Moving this logic to the WarmPool ensures that the SandboxClaim only needs to perform a simple pointer update to a fully-ready environment.

  2. Latency and Pre-warming
    The most significant impact is on allocation latency. By pre-warming the entire sandbox environment, including headless network services and storage, the system moves these time-consuming Kubernetes operations out of the user’s critical path - This approach has demonstrated the potential for sub-second allocation latencies in recent benchmarks. This is a crucial improvement for "Agent-in-the-Loop" use cases where sub-second responsiveness is mandatory.

  3. Maintainability and Observability
    Restoring the 1-to-1 naming convention addresses a major pain point in system observability. Currently, mismatched names between adopted pods and their parent sandboxes make it difficult for platform engineers to track resource ownership and have led to bugs where the controller fails to properly watch or reconcile adopted pods.

Consideration of Trade-offs

Storage Costs: Pre-warming PVCs incurs storage costs before they are claimed.
Zonal Locking: Pre-warming a PVC locks the eventual Sandbox to the specific zone where the PV was created.
FQDN Predictability: Because warmpool resources often use random suffixes to avoid collisions, service names (and thus their FQDNs) may be less predictable until claimed.

Overall, the community momentum behind PR #375 suggests that the performance gains outweigh these trade-offs, also I would note that when a user decides to use Warmpools they know purposefully that the intention is to sacrifice cost/felxibility for higher performance gains.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions