-
Notifications
You must be signed in to change notification settings - Fork 160
Description
Following recent activities for performance optimization, benchmarking controllers and community member PRs:
#162
#375
We would like to re-visit the design choice of transitioning warmpools to create full Sandbox objects rather than raw pods.
This design transition is an effective architectural evolution for agent-sandbox, addressing both critical performance bottlenecks and long-term maintainability issues identified in previous iterations.
-
Architectural Simplification
Transitioning to "warm sandboxes" effectively refactors the SandboxClaim controller from a complex resource "stitcher" into a streamlined lease negotiator In the current model, the claim process is bottlenecked by the need to create a Sandbox object on-the-fly and then coordinate its association with a pre-existing pod, which often triggers secondary reconciliations and increases the surface area for race conditions. Moving this logic to the WarmPool ensures that the SandboxClaim only needs to perform a simple pointer update to a fully-ready environment. -
Latency and Pre-warming
The most significant impact is on allocation latency. By pre-warming the entire sandbox environment, including headless network services and storage, the system moves these time-consuming Kubernetes operations out of the user’s critical path - This approach has demonstrated the potential for sub-second allocation latencies in recent benchmarks. This is a crucial improvement for "Agent-in-the-Loop" use cases where sub-second responsiveness is mandatory. -
Maintainability and Observability
Restoring the 1-to-1 naming convention addresses a major pain point in system observability. Currently, mismatched names between adopted pods and their parent sandboxes make it difficult for platform engineers to track resource ownership and have led to bugs where the controller fails to properly watch or reconcile adopted pods.
Consideration of Trade-offs
Storage Costs: Pre-warming PVCs incurs storage costs before they are claimed.
Zonal Locking: Pre-warming a PVC locks the eventual Sandbox to the specific zone where the PV was created.
FQDN Predictability: Because warmpool resources often use random suffixes to avoid collisions, service names (and thus their FQDNs) may be less predictable until claimed.
Overall, the community momentum behind PR #375 suggests that the performance gains outweigh these trade-offs, also I would note that when a user decides to use Warmpools they know purposefully that the intention is to sacrifice cost/felxibility for higher performance gains.