Conversation
This proposal introduces three enhancements to the existing warm pool: 1. **Dynamic Pool Sizing**: Automatically adjust warm pool size based on historical traffic patterns using a simple prediction algorithm. 2. **Multi-Level Warm Pool**: Implement Hot/Warm/Cold pool hierarchy: - Hot Pool: Fully started, ready to serve immediately (3-5 sandboxes) - Warm Pool: Image pulled, container paused, fast startup (10-20 sandboxes) - Cold Pool: Only configuration stored, on-demand creation 3. **Sandbox Reuse Strategy**: Reuse sandboxes across sessions by cleaning state instead of destroying, similar to database connection pooling. Key Components: - PoolManager: Unified pool management with Hot/Warm/Cold transitions - TrafficPredictor: Traffic prediction based on historical patterns - PoolAutoscaler: Automatic pool size adjustment based on predictions - Picod cleanup endpoint: State cleanup for sandbox reuse
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello @Tweakzx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request proposes a comprehensive enhancement to the existing warm pool system. The primary goal is to significantly improve resource utilization and reduce cold-start latencies by introducing dynamic pool sizing based on traffic predictions, a multi-tiered Hot/Warm/Cold pool structure, and a mechanism for reusing sandboxes after session termination through state cleanup. These changes aim to make the system more adaptive, efficient, and performant under varying loads. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive and well-structured design proposal for an enhanced warm pool system. The document clearly outlines the motivation, goals, and key features, including a multi-level pool hierarchy, dynamic sizing, and a sandbox reuse strategy. The design is sound, and the phased implementation plan is practical. My feedback focuses on strengthening the security section, clarifying the necessary API changes for the CodeInterpreter CRD, and correcting a few minor typos to improve the document's clarity.
| # Scale-down cooldown | ||
| scaleDownCooldown: 15m | ||
| # Safety margin (predict * 1.2) | ||
| safetyMargin: 1.2 | ||
| # Or static size (mutually exclusive) | ||
| # staticSize: 10 | ||
| ``` | ||
|
|
||
| ### Implementation | ||
|
|
||
| 1. **Traffic Collector**: Collect and store traffic metrics | ||
| - Integrated into Router for request counting | ||
| - Persists to Redis with TTL | ||
|
|
||
| 2. **Traffic Predictor**: Analyze and predict | ||
| - Runs periodically (every 5 minutes) | ||
| - Outputs recommended pool sizes | ||
|
|
||
| 3. **Pool Autoscaler**: Apply predictions | ||
| - Updates WarmPool and HotPool sizes | ||
| - Respects cooldown periods | ||
| - Gradual scale-up/down | ||
|
|
||
| --- | ||
|
|
||
| ## 3. Sandbox Reuse Strategy | ||
|
|
||
| ### Concept | ||
|
|
||
| Instead of destroying sandboxes after session ends, clean their state and return to pool: | ||
|
|
||
| ``` | ||
| Session End → State Cleanup → Return to Hot Pool | ||
| │ | ||
| ├── Clear /workspace directory | ||
| ├── Reset environment variables | ||
| ├── Kill user processes | ||
| └── Clear network state | ||
| ``` | ||
|
|
||
| ### Benefits | ||
|
|
||
| 1. **Zero Cold-Start**: Reused sandboxes are immediately available | ||
| 2. **Resource Efficiency**: Avoid repeated container creation/destruction | ||
| 3. **Better Performance**: Cached images, warmed up connections | ||
|
|
||
| ### State Cleanup Protocol | ||
|
|
||
| Picod (inside sandbox) exposes cleanup endpoint: | ||
|
|
||
| ```go | ||
| // POST /internal/cleanup | ||
| type CleanupRequest struct { | ||
| // Clear all user files | ||
| ClearWorkspace bool | ||
| // Kill all user processes | ||
| KillProcesses bool | ||
| // Reset environment | ||
| ResetEnv bool | ||
| // Clear network connections | ||
| ClearNetwork bool | ||
| } | ||
|
|
||
| type CleanupResponse struct { | ||
| Success bool | ||
| Message string | ||
| Duration time.Duration | ||
| } | ||
| ``` | ||
|
|
||
| ### Implementation | ||
|
|
||
| 1. **Session End Handler**: In workload manager | ||
| ```go | ||
| func (s *Server) handleSessionEnd(sessionID string) error { | ||
| // 1. Get sandbox info | ||
| sandbox := s.store.GetSandbox(sessionID) | ||
|
|
||
| // 2. Call cleanup endpoint on picod | ||
| err := s.picodClient.Cleanup(sandbox.Endpoint) | ||
| if err != nil { | ||
| // Fallback: destroy sandbox | ||
| return s.destroySandbox(sandbox) | ||
| } | ||
|
|
||
| // 3. Mark sandbox as available in hot pool | ||
| s.poolManager.ReturnToHotPool(sandbox) | ||
|
|
||
| // 4. Update store | ||
| s.store.MarkSandboxAvailable(sessionID) | ||
|
|
||
| return nil | ||
| } | ||
| ``` | ||
|
|
||
| 2. **Pool Allocation**: Modified sandbox creation | ||
| ```go | ||
| func (s *Server) allocateSandbox(kind, namespace, name string) (*Sandbox, error) { | ||
| // 1. Try hot pool first | ||
| if sandbox := s.hotPool.Get(namespace, name); sandbox != nil { | ||
| return sandbox, nil | ||
| } | ||
|
|
||
| // 2. Try warm pool | ||
| if sandbox := s.warmPool.Get(namespace, name); sandbox != nil { | ||
| return sandbox, nil | ||
| } | ||
|
|
||
| // 3. Create new (cold pool) | ||
| return s.createSandbox(kind, namespace, name) | ||
| } | ||
| ``` | ||
|
|
||
| 3. **Picod Cleanup Implementation**: In `pkg/picod/cleanup.go` | ||
| - Implements cleanup endpoint | ||
| - Secure process termination | ||
| - Workspace sanitization | ||
|
|
||
| ### Security Considerations | ||
|
|
||
| 1. **State Isolation**: Ensure complete cleanup between sessions | ||
| 2. **Resource Limits**: Prevent cleanup from consuming excessive resources | ||
| 3. **Timeout**: Maximum cleanup duration (30 seconds) | ||
| 4. **Audit Logging**: Log all cleanup operations | ||
|
|
||
| --- | ||
|
|
||
| ## Architecture Overview | ||
|
|
||
| ``` | ||
| ┌────────────────────────────────────────────────────────────────────┐ | ||
| │ Router │ | ||
| │ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │ | ||
| │ │Traffic Collec│ │Session Manager│ │Request Forwarder │ │ | ||
| │ └─────────────┘ └──────────────┘ └─────────────────────────┘ │ | ||
| └────────────────────────────────────────────────────────────────────┘ | ||
| │ | ||
| ▼ | ||
| ┌────────────────────────────────────────────────────────────────────┐ | ||
| │ Workload Manager │ | ||
| │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐ │ | ||
| │ │ Pool Manager │ │Traffic Predic│ │ Sandbox Reuse Manager │ │ | ||
| │ │ │◄─┤ tor │ │ │ │ | ||
| │ └──────────────┘ └──────────────┘ └──────────────────────────┘ │ | ||
| │ │ │ │ | ||
| │ ▼ ▼ │ | ||
| │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐ │ | ||
| │ │ Hot Pool │ │ Warm Pool │ │ Garbage Collector │ │ | ||
| │ │ Controller │ │ Controller │ │ (cleanup fallback) │ │ | ||
| │ └──────────────┘ └──────────────┘ └──────────────────────────┘ │ | ||
| └────────────────────────────────────────────────────────────────────┘ | ||
| │ | ||
| ▼ | ||
| ┌────────────────────────────────────────────────────────────────────┐ | ||
| │ Kubernetes API │ | ||
| │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐ │ | ||
| │ │SandboxHotPool│ │SandboxWarmPol│ │ SandboxTemplate │ │ | ||
| │ │ CRD │ │ CRD │ │ CRD │ │ | ||
| │ └──────────────┘ └──────────────┘ └──────────────────────────┘ │ | ||
| └────────────────────────────────────────────────────────────────────┘ | ||
| ``` | ||
|
|
||
| ## Implementation Plan | ||
|
|
||
| ### Phase 1: Sandbox Reuse (Week 1-2) | ||
| 1. Add cleanup endpoint to Picod | ||
| 2. Implement SessionEndHandler in workload manager | ||
| 3. Modify sandbox allocation to check hot pool first | ||
| 4. Add configuration flags | ||
|
|
||
| ### Phase 2: Multi-Level Pool (Week 3-4) | ||
| 1. Define SandboxHotPool CRD | ||
| 2. Implement HotPoolController | ||
| 3. Create PoolManager for unified pool access | ||
| 4. Update CodeInterpreter API to support multi-level configuration | ||
|
|
||
| ### Phase 3: Dynamic Sizing (Week 5-6) | ||
| 1. Implement traffic collector in Router | ||
| 2. Create TrafficPredictor with simple algorithm | ||
| 3. Implement PoolAutoscaler | ||
| 4. Add monitoring and alerting | ||
|
|
||
| ## Metrics and Monitoring | ||
|
|
||
| New metrics exposed: | ||
|
|
||
| ``` | ||
| # Hot Pool metrics | ||
| agentcube_hot_pool_size{namespace, name} | ||
| agentcube_hot_pool_available{namespace, name} | ||
| agentcube_hot_pool_hit_rate{namespace, name} | ||
|
|
||
| # Warm Pool metrics |
There was a problem hiding this comment.
This is a great design document. I found a few minor typos that could be fixed to improve clarity:
- Line 194: Trailing spaces in the comment:
# Scale-down cooldown - Architecture Diagram (Lines 324-353):
Traffic Collecshould beTraffic Collector.Traffic Predicandtorshould beTraffic Predictor.SandboxWarmPolshould beSandboxWarmPool.
- Line 386: Trailing spaces in the comment:
# Warm Pool metrics
| 1. **State Isolation**: Ensure complete cleanup between sessions | ||
| 2. **Resource Limits**: Prevent cleanup from consuming excessive resources | ||
| 3. **Timeout**: Maximum cleanup duration (30 seconds) | ||
| 4. **Audit Logging**: Log all cleanup operations |
There was a problem hiding this comment.
The security considerations for sandbox reuse are well-thought-out. To further strengthen the security posture, it would be beneficial to explicitly mention the strategy of limiting the number of times a sandbox can be reused. This mitigates risks from subtle, long-term state leakage or potential exploits that might persist across cleanups. The maxReuseCount parameter, which is already included in the configuration example, is an excellent implementation of this and should be highlighted in this section.
| ## Configuration Examples | ||
|
|
||
| ### Minimal (reuse only) | ||
| ```yaml | ||
| apiVersion: runtime.agentcube.volcano.sh/v1alpha1 | ||
| kind: CodeInterpreter | ||
| metadata: | ||
| name: my-interpreter | ||
| spec: | ||
| warmPool: | ||
| staticSize: 10 | ||
| reuse: | ||
| enabled: true | ||
| ``` | ||
|
|
||
| ### Full featured | ||
| ```yaml | ||
| apiVersion: runtime.agentcube.volcano.sh/v1alpha1 | ||
| kind: CodeInterpreter | ||
| metadata: | ||
| name: my-interpreter | ||
| spec: | ||
| warmPool: | ||
| dynamicSizing: | ||
| enabled: true | ||
| minSize: 5 | ||
| maxSize: 50 | ||
| predictionWindow: 30m | ||
| hotPool: | ||
| enabled: true | ||
| minSize: 3 | ||
| maxSize: 10 | ||
| idleTimeout: 5m | ||
| reuse: | ||
| enabled: true | ||
| cleanupTimeout: 30s | ||
| maxReuseCount: 100 # Destroy after N reuses | ||
| ``` |
There was a problem hiding this comment.
The configuration examples introduce new top-level fields in the CodeInterpreter spec, namely dynamicSizing (within warmPool), hotPool, and reuse. These are significant API changes. The proposal should explicitly state that the CodeInterpreter CRD and its corresponding CodeInterpreterSpec Go type will be updated to include these new configuration sections. This will provide essential clarity for both implementers and users regarding the upcoming API evolution.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #224 +/- ##
==========================================
+ Coverage 35.60% 42.04% +6.43%
==========================================
Files 29 30 +1
Lines 2533 2590 +57
==========================================
+ Hits 902 1089 +187
+ Misses 1505 1369 -136
- Partials 126 132 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR introduces a comprehensive design proposal for enhancing AgentCube's warm pool functionality. The proposal aims to address current limitations in static pool sizing, single-level pool architecture, and lack of sandbox reuse by introducing three major enhancements: dynamic pool sizing based on traffic prediction, a multi-level hot/warm/cold pool hierarchy, and sandbox reuse through state cleanup mechanisms.
Changes:
- Adds a detailed design document proposing dynamic warm pool sizing with traffic prediction algorithms
- Introduces a multi-level pool architecture (Hot/Warm/Cold) with new SandboxHotPool CRD
- Proposes sandbox reuse strategy with cleanup protocols to reduce cold-start latency
| 1. **Session End Handler**: In workload manager | ||
| ```go | ||
| func (s *Server) handleSessionEnd(sessionID string) error { | ||
| // 1. Get sandbox info | ||
| sandbox := s.store.GetSandbox(sessionID) | ||
|
|
||
| // 2. Call cleanup endpoint on picod | ||
| err := s.picodClient.Cleanup(sandbox.Endpoint) | ||
| if err != nil { | ||
| // Fallback: destroy sandbox | ||
| return s.destroySandbox(sandbox) | ||
| } | ||
|
|
||
| // 3. Mark sandbox as available in hot pool | ||
| s.poolManager.ReturnToHotPool(sandbox) | ||
|
|
||
| // 4. Update store | ||
| s.store.MarkSandboxAvailable(sessionID) | ||
|
|
||
| return nil | ||
| } | ||
| ``` | ||
|
|
||
| 2. **Pool Allocation**: Modified sandbox creation | ||
| ```go | ||
| func (s *Server) allocateSandbox(kind, namespace, name string) (*Sandbox, error) { | ||
| // 1. Try hot pool first | ||
| if sandbox := s.hotPool.Get(namespace, name); sandbox != nil { | ||
| return sandbox, nil | ||
| } | ||
|
|
||
| // 2. Try warm pool | ||
| if sandbox := s.warmPool.Get(namespace, name); sandbox != nil { | ||
| return sandbox, nil | ||
| } | ||
|
|
||
| // 3. Create new (cold pool) | ||
| return s.createSandbox(kind, namespace, name) | ||
| } | ||
| ``` |
There was a problem hiding this comment.
The proposal shows code examples with undefined types and functions like "s.picodClient.Cleanup()", "s.poolManager.ReturnToHotPool()", "s.hotPool.Get()", and "s.warmPool.Get()" that don't exist in the current codebase. While this is expected for a design proposal, the implementation details should clarify these are new components to be created and provide more specifics about their interfaces, particularly how they integrate with the existing controller-runtime patterns used in pkg/workloadmanager.
| 4. Update CodeInterpreter API to support multi-level configuration | ||
|
|
||
| ### Phase 3: Dynamic Sizing (Week 5-6) | ||
| 1. Implement traffic collector in Router |
There was a problem hiding this comment.
The architecture diagram shows "Traffic Collec" (likely "Traffic Collector") as a component in Router, but the implementation plan in Phase 3 mentions "Implement traffic collector in Router" without specifying how this integrates with the existing Gin-based router in pkg/router/server.go. Consider documenting whether this will be a middleware, a separate goroutine, or part of existing request handlers, and how it avoids adding latency to request processing.
| 1. Implement traffic collector in Router | |
| 1. Implement traffic collector in Router | |
| - Add a lightweight Gin middleware in `pkg/router/server.go` that attaches to the existing invocation routes and records per-request metadata (namespace, runtime name, route, status code, request/response size, duration). | |
| - The middleware must perform only cheap, in-memory operations on the hot path: timestamping, basic field extraction, and enqueueing into a buffered channel; it must not perform any blocking I/O or heavy computation. | |
| - Start a background worker (goroutine) when initializing the Router that reads from this channel and aggregates traffic data into in-memory counters/time-series used by the `TrafficPredictor` and exported as metrics. | |
| - Ensure the enqueue operation in the middleware is non-blocking (for example, using a `select` with a `default` case when the channel is full) so that traffic collection cannot add latency or backpressure to request processing. |
| historyStore HistoryStore | ||
| // Prediction window | ||
| predictionWindow time.Duration | ||
| // Minimum samples for prediction |
There was a problem hiding this comment.
The TrafficPredictor struct includes a "minSamples" field, but there's no guidance on what happens when insufficient samples are available (e.g., during initial deployment or after low-traffic periods). Consider documenting the fallback behavior: should it use a conservative default pool size, maintain the current size, or apply a specific bootstrap strategy?
| // Minimum samples for prediction | |
| // Minimum samples for prediction. When the available samples in the historyStore | |
| // are below this threshold, the predictor does not attempt to compute a new | |
| // demand estimate. Instead, it falls back to a conservative bootstrap strategy: | |
| // use a safe default or maintain the current pool size and mark any derived | |
| // PredictionResult as low-confidence. This avoids overreacting to sparse or | |
| // startup traffic while the history is still being populated. |
| 1. **State Isolation**: Ensure complete cleanup between sessions | ||
| 2. **Resource Limits**: Prevent cleanup from consuming excessive resources | ||
| 3. **Timeout**: Maximum cleanup duration (30 seconds) | ||
| 4. **Audit Logging**: Log all cleanup operations |
There was a problem hiding this comment.
The security section mentions "State Isolation" and "Audit Logging" as critical requirements, but doesn't provide concrete implementation details or specify how these should be verified. Given the security implications of sandbox reuse (potential cross-session data leakage), this section should include more specific requirements such as: what exactly needs to be cleaned (file handles, shared memory, IPC objects, network sockets), how to verify cleanup completeness, and what audit logs should include (session IDs, cleanup results, any failures).
| 1. **State Isolation**: Ensure complete cleanup between sessions | |
| 2. **Resource Limits**: Prevent cleanup from consuming excessive resources | |
| 3. **Timeout**: Maximum cleanup duration (30 seconds) | |
| 4. **Audit Logging**: Log all cleanup operations | |
| 1. **State Isolation**: Ensure complete cleanup between sessions before a sandbox is returned to any warm/hot pool | |
| - **Process state**: terminate all user/session processes; only the sandbox supervisor/runtime (e.g. picod) may remain | |
| - **File descriptors**: close all non-essential file handles (no leftover open files, pipes, or device handles from the previous session) | |
| - **Filesystem/workspace**: | |
| - remove all session-specific files and directories from the workspace, including temporary files and scratch data | |
| - clear any per-session configuration, credentials, and secrets from disk | |
| - ensure only the base image and expected runtime artifacts remain | |
| - **IPC and shared memory**: remove/cleanup shared memory segments, UNIX domain sockets, named pipes, message queues, and any other IPC objects created by the session | |
| - **Network state**: close all session-created network sockets and listeners; ensure no ports remain bound on the sandbox interface except those required by the control plane | |
| - **In-memory state**: clear per-session in-memory caches, environment variables, and any other transient state holding user data or secrets | |
| - **Verification before reuse**: | |
| - the cleanup endpoint must run a deterministic set of checks (e.g. no extra processes, no unexpected open FDs, workspace matches expected baseline) | |
| - if any verification check fails, the sandbox MUST NOT be returned to a warm/hot pool and SHOULD be destroyed instead | |
| - the cleanup response MUST indicate success/failure and the reason for any failure, so the workload manager can enforce policy | |
| 2. **Resource Limits**: Prevent cleanup from consuming excessive resources | |
| - enforce CPU and memory limits for the cleanup routine (e.g. via cgroup or microVM limits) | |
| - limit maximum workspace size to be scanned/removed per cleanup to avoid pathological cases | |
| - ensure cleanup does not starve active sandboxes (e.g. cap concurrent cleanups per node) | |
| 3. **Timeout**: Maximum cleanup duration (30 seconds) | |
| - if cleanup does not complete within the timeout, treat it as a failure and destroy the sandbox instead of reusing it | |
| - timeouts and failures must be surfaced to operators via metrics and logs | |
| 4. **Audit Logging**: Log all cleanup operations in both workload manager and picod where applicable | |
| - include at minimum: session ID, sandbox ID, namespace/name, runtime kind, and node identifier | |
| - record timestamps for when cleanup started and completed, and the total duration | |
| - capture the cleanup result (success/failure), verification status, and whether the sandbox was: | |
| - returned to hot/warm pool, or | |
| - destroyed/terminated due to cleanup failure or timeout | |
| - log any errors encountered during cleanup (e.g. failure to delete files, kill processes, or close sockets), including enough detail for incident response without leaking user content | |
| - ensure logs are structured (e.g. JSON) to support automated analysis and correlation across components |
| - PeriodicPattern: Time-of-day adjustment based on historical patterns | ||
| - Trend: Linear trend from recent data | ||
| ``` | ||
|
|
There was a problem hiding this comment.
The prediction algorithm mentions "Time-of-day adjustment based on historical patterns" but doesn't explain how this handles timezone considerations or different usage patterns across global deployments. Consider documenting whether traffic patterns are stored/calculated in UTC, local cluster time, or user timezone, and how the system handles multi-region scenarios where different regions may have different traffic patterns.
| **Timezone and multi-region semantics** | |
| - All `TrafficSample.Timestamp` values and historical aggregates are stored and processed in UTC. | |
| - Time-of-day buckets used for `PeriodicPattern` are defined in UTC; when needed, a cluster or region may interpret predictions in its configured local timezone by applying the appropriate offset at evaluation time, without mutating stored data. | |
| - Multi-region deployments maintain independent traffic patterns per region/cluster (e.g., keyed by cluster or region identifier), so that differing local peak hours do not get merged into a single global pattern. | |
| - If a deployment needs user-local time semantics, this is modeled via configuration (for example, per-runtime or per-namespace “local region/timezone”), but the canonical storage and computation layer remains UTC-based. |
| # Or static size (mutually exclusive) | ||
| # staticSize: 10 |
There was a problem hiding this comment.
The proposal shows mutually exclusive configuration between "dynamicSizing" and "staticSize" with a comment, but doesn't specify how this mutual exclusivity is enforced. Consider adding a kubebuilder validation annotation or webhook validation to prevent users from accidentally configuring both, or document the precedence rules if both are specified.
| HotPool --> WarmPool: Idle timeout | ||
| WarmPool --> ColdPool: Scale down / low traffic | ||
|
|
||
| InUse --> WarmPool: Explicit release |
There was a problem hiding this comment.
The Pool Transition Flow diagram shows "InUse --> WarmPool: Explicit release" as a possible transition, but the sandbox reuse section states sandboxes return to the Hot Pool after cleanup. This inconsistency needs clarification: under what conditions does a sandbox go to Warm Pool versus Hot Pool after being used? Consider aligning the state diagram with the detailed flow described in the Sandbox Reuse Strategy section.
| InUse --> WarmPool: Explicit release | |
| %% After session cleanup, sandboxes always return to HotPool; | |
| %% demotion to WarmPool happens only via HotPool --> WarmPool. |
| ### Phase 1: Sandbox Reuse (Week 1-2) | ||
| 1. Add cleanup endpoint to Picod | ||
| 2. Implement SessionEndHandler in workload manager | ||
| 3. Modify sandbox allocation to check hot pool first | ||
| 4. Add configuration flags | ||
|
|
||
| ### Phase 2: Multi-Level Pool (Week 3-4) | ||
| 1. Define SandboxHotPool CRD | ||
| 2. Implement HotPoolController | ||
| 3. Create PoolManager for unified pool access | ||
| 4. Update CodeInterpreter API to support multi-level configuration | ||
|
|
||
| ### Phase 3: Dynamic Sizing (Week 5-6) | ||
| 1. Implement traffic collector in Router | ||
| 2. Create TrafficPredictor with simple algorithm | ||
| 3. Implement PoolAutoscaler |
There was a problem hiding this comment.
The implementation timeline shows "Week 1-2", "Week 3-4", "Week 5-6" but these phases have dependencies (e.g., Phase 2 needs PoolManager which is used by Phase 1's SessionEndHandler). Consider clarifying whether these are sequential (12 weeks total) or if there's expected parallelization, and document the critical path dependencies between phases.
| ### Phase 1: Sandbox Reuse (Week 1-2) | |
| 1. Add cleanup endpoint to Picod | |
| 2. Implement SessionEndHandler in workload manager | |
| 3. Modify sandbox allocation to check hot pool first | |
| 4. Add configuration flags | |
| ### Phase 2: Multi-Level Pool (Week 3-4) | |
| 1. Define SandboxHotPool CRD | |
| 2. Implement HotPoolController | |
| 3. Create PoolManager for unified pool access | |
| 4. Update CodeInterpreter API to support multi-level configuration | |
| ### Phase 3: Dynamic Sizing (Week 5-6) | |
| 1. Implement traffic collector in Router | |
| 2. Create TrafficPredictor with simple algorithm | |
| 3. Implement PoolAutoscaler | |
| ### Phase 1: Sandbox Reuse (Weeks 1-2) | |
| Phases are primarily sequential (total ~6 elapsed weeks), with limited parallelization where dependencies allow. Work in later phases must not start on items that depend on components introduced in earlier phases (e.g., `PoolManager` and reuse wiring). | |
| 1. Create PoolManager for unified pool access (initially targeting the existing warm pool) | |
| 2. Add cleanup endpoint to Picod | |
| 3. Implement SessionEndHandler in workload manager using PoolManager | |
| 4. Modify sandbox allocation to check the hot/warm pool via PoolManager before provisioning new sandboxes | |
| 5. Add configuration flags | |
| ### Phase 2: Multi-Level Pool (Weeks 3-4) | |
| 1. Define SandboxHotPool CRD | |
| 2. Implement HotPoolController | |
| 3. Integrate HotPoolController with PoolManager for unified multi-level pool access | |
| 4. Update CodeInterpreter API to support multi-level configuration | |
| ### Phase 3: Dynamic Sizing (Weeks 5-6) | |
| 1. Implement traffic collector in Router | |
| 2. Create TrafficPredictor with simple algorithm | |
| 3. Implement PoolAutoscaler leveraging PoolManager and multi-level pool metrics |
| #### SandboxHotPool CRD | ||
|
|
||
| ```yaml | ||
| apiVersion: extensions.agent.x-k8s.io/v1alpha1 |
There was a problem hiding this comment.
The apiVersion for SandboxHotPool CRD uses "extensions.agent.x-k8s.io/v1alpha1", but based on the codebase, the existing agent-sandbox CRDs (like SandboxWarmPool, SandboxTemplate, and SandboxClaim) use the API group from "sigs.k8s.io/agent-sandbox/extensions/api/v1alpha1". This inconsistency could cause confusion. Consider aligning with the existing agent-sandbox API group pattern, or clarify if this is a new separate API group and explain the reasoning.
| apiVersion: extensions.agent.x-k8s.io/v1alpha1 | |
| apiVersion: sigs.k8s.io/agent-sandbox/extensions/api/v1alpha1 |
| reuse: | ||
| enabled: true | ||
| cleanupTimeout: 30s | ||
| maxReuseCount: 100 # Destroy after N reuses |
There was a problem hiding this comment.
The configuration shows "maxReuseCount: 100" to destroy sandboxes after N reuses, but there's no explanation of why this limit is needed or how to choose an appropriate value. From a security and operational perspective, document the reasoning (e.g., prevent memory leaks, ensure fresh state periodically) and provide guidance on tuning this value based on workload characteristics.
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Thanks for the good input, from my side i think it is too coarse to get the complete flow. Can you please supplement first?
| The current warm pool implementation has several limitations: | ||
|
|
||
| 1. **Static Pool Size**: `WarmPoolSize` is statically configured, cannot adapt to traffic patterns | ||
| 2. **Single-Level Pool**: Only one warm pool level, no differentiation between ready-to-use and quick-start states |
There was a problem hiding this comment.
I donot quite understand this
|
|
||
| ``` | ||
| ┌─────────────────────────────────────────────────────────────┐ | ||
| │ Cold Pool (Infinite) │ |
There was a problem hiding this comment.
Suggest not use pool concept here, cause confusion
| │ | ||
| ┌─────────────────────────────────────────────────────────────┐ | ||
| │ Warm Pool (10-20 sandboxes) │ | ||
| │ - Image pulled, container created but paused │ |
| 1. **Hot Pool Controller**: New controller in `pkg/workloadmanager/hotpool_controller.go` | ||
| - Maintains a pool of ready sandboxes | ||
| - Monitors pool size and creates/destroys as needed | ||
| - Integrates with traffic predictor |
There was a problem hiding this comment.
maybe you can first elaborate all the new concepts before the API design section
| availableReplicas: 3 | ||
| ``` | ||
|
|
||
| ### Pool Transition Flow |
There was a problem hiding this comment.
For this part, i want to know how to make the sandbox transit between WarmPool and HotPool, i want to see this design
| - Integrates with traffic predictor | ||
|
|
||
| 2. **Pool Manager**: New component in `pkg/workloadmanager/pool_manager.go` | ||
| - Manages transitions between Hot/Warm/Cold pools | ||
| - Provides unified interface for sandbox allocation | ||
| - Handles pool overflow/underflow | ||
|
|
||
| --- | ||
|
|
||
| ## 2. Dynamic Warm Pool Sizing | ||
|
|
||
| ### Traffic Prediction | ||
|
|
||
| Based on historical data, predict future demand: | ||
|
|
||
| ```go | ||
| type TrafficPredictor struct { | ||
| // Historical data store | ||
| historyStore HistoryStore |
|
|
||
| ### Traffic Prediction | ||
|
|
||
| Based on historical data, predict future demand: |
There was a problem hiding this comment.
please add some more details
There was a problem hiding this comment.
data plane traffic can be known only by router, but this proposal targets on the workload manager
| predictionWindow: 30m | ||
| # Scale-up cooldown | ||
| scaleUpCooldown: 5m | ||
| # Scale-down cooldown |
| } | ||
|
|
||
| // 2. Try warm pool | ||
| if sandbox := s.warmPool.Get(namespace, name); sandbox != nil { |
There was a problem hiding this comment.
As the warmpool sandbox is paused, donot you need resume first?
This proposal introduces three enhancements to the existing warm pool:
Dynamic Pool Sizing: Automatically adjust warm pool size based on historical traffic patterns using a simple prediction algorithm.
Multi-Level Warm Pool: Implement Hot/Warm/Cold pool hierarchy:
Sandbox Reuse Strategy: Reuse sandboxes across sessions by cleaning state instead of destroying, similar to database connection pooling.
Key Components:
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: