proposal: Enhanced Warm Pool Design by Tweakzx · Pull Request #224 · volcano-sh/agentcube

Tweakzx · 2026-02-24T06:43:25Z

This proposal introduces three enhancements to the existing warm pool:

Dynamic Pool Sizing: Automatically adjust warm pool size based on historical traffic patterns using a simple prediction algorithm.
Multi-Level Warm Pool: Implement Hot/Warm/Cold pool hierarchy:
- Hot Pool: Fully started, ready to serve immediately (3-5 sandboxes)
- Warm Pool: Image pulled, container paused, fast startup (10-20 sandboxes)
- Cold Pool: Only configuration stored, on-demand creation
Sandbox Reuse Strategy: Reuse sandboxes across sessions by cleaning state instead of destroying, similar to database connection pooling.

Key Components:

PoolManager: Unified pool management with Hot/Warm/Cold transitions
TrafficPredictor: Traffic prediction based on historical patterns
PoolAutoscaler: Automatic pool size adjustment based on predictions
Picod cleanup endpoint: State cleanup for sandbox reuse

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

This proposal introduces three enhancements to the existing warm pool: 1. **Dynamic Pool Sizing**: Automatically adjust warm pool size based on historical traffic patterns using a simple prediction algorithm. 2. **Multi-Level Warm Pool**: Implement Hot/Warm/Cold pool hierarchy: - Hot Pool: Fully started, ready to serve immediately (3-5 sandboxes) - Warm Pool: Image pulled, container paused, fast startup (10-20 sandboxes) - Cold Pool: Only configuration stored, on-demand creation 3. **Sandbox Reuse Strategy**: Reuse sandboxes across sessions by cleaning state instead of destroying, similar to database connection pooling. Key Components: - PoolManager: Unified pool management with Hot/Warm/Cold transitions - TrafficPredictor: Traffic prediction based on historical patterns - PoolAutoscaler: Automatic pool size adjustment based on predictions - Picod cleanup endpoint: State cleanup for sandbox reuse

volcano-sh-bot · 2026-02-24T06:43:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

docs/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2026-02-24T06:43:37Z

Summary of Changes

Hello @Tweakzx, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request proposes a comprehensive enhancement to the existing warm pool system. The primary goal is to significantly improve resource utilization and reduce cold-start latencies by introducing dynamic pool sizing based on traffic predictions, a multi-tiered Hot/Warm/Cold pool structure, and a mechanism for reusing sandboxes after session termination through state cleanup. These changes aim to make the system more adaptive, efficient, and performant under varying loads.

Highlights

Dynamic Pool Sizing: Introduced automatic adjustment of warm pool size based on historical traffic patterns using a simple prediction algorithm, replacing static configurations.
Multi-Level Warm Pool: Implemented a Hot/Warm/Cold pool hierarchy to optimize resource usage and reduce latency, featuring fully started 'Hot' sandboxes, paused 'Warm' sandboxes, and configuration-only 'Cold' sandboxes.
Sandbox Reuse Strategy: Enabled the reuse of sandboxes across sessions by cleaning their state instead of destroying them, similar to database connection pooling, to achieve zero cold-start latency and improve resource efficiency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

docs/design/enhanced-warm-pool-design.md
- Added a detailed design document outlining the motivation, goals, non-goals, API design, implementation details, architecture overview, implementation plan, metrics, and configuration examples for the enhanced warm pool.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive and well-structured design proposal for an enhanced warm pool system. The document clearly outlines the motivation, goals, and key features, including a multi-level pool hierarchy, dynamic sizing, and a sandbox reuse strategy. The design is sound, and the phased implementation plan is practical. My feedback focuses on strengthening the security section, clarifying the necessary API changes for the CodeInterpreter CRD, and correcting a few minor typos to improve the document's clarity.

gemini-code-assist · 2026-02-24T06:44:34Z

docs/design/enhanced-warm-pool-design.md

+      # Scale-down cooldown  
+      scaleDownCooldown: 15m
+      # Safety margin (predict * 1.2)
+      safetyMargin: 1.2
+    # Or static size (mutually exclusive)
+    # staticSize: 10
+```
+
+### Implementation
+
+1. **Traffic Collector**: Collect and store traffic metrics
+   - Integrated into Router for request counting
+   - Persists to Redis with TTL
+
+2. **Traffic Predictor**: Analyze and predict
+   - Runs periodically (every 5 minutes)
+   - Outputs recommended pool sizes
+
+3. **Pool Autoscaler**: Apply predictions
+   - Updates WarmPool and HotPool sizes
+   - Respects cooldown periods
+   - Gradual scale-up/down
+
+---
+
+## 3. Sandbox Reuse Strategy
+
+### Concept
+
+Instead of destroying sandboxes after session ends, clean their state and return to pool:
+
+```
+Session End → State Cleanup → Return to Hot Pool
+                  │
+                  ├── Clear /workspace directory
+                  ├── Reset environment variables
+                  ├── Kill user processes
+                  └── Clear network state
+```
+
+### Benefits
+
+1. **Zero Cold-Start**: Reused sandboxes are immediately available
+2. **Resource Efficiency**: Avoid repeated container creation/destruction
+3. **Better Performance**: Cached images, warmed up connections
+
+### State Cleanup Protocol
+
+Picod (inside sandbox) exposes cleanup endpoint:
+
+```go
+// POST /internal/cleanup
+type CleanupRequest struct {
+    // Clear all user files
+    ClearWorkspace bool
+    // Kill all user processes
+    KillProcesses bool
+    // Reset environment
+    ResetEnv bool
+    // Clear network connections
+    ClearNetwork bool
+}
+
+type CleanupResponse struct {
+    Success bool
+    Message string
+    Duration time.Duration
+}
+```
+
+### Implementation
+
+1. **Session End Handler**: In workload manager
+   ```go
+   func (s *Server) handleSessionEnd(sessionID string) error {
+       // 1. Get sandbox info
+       sandbox := s.store.GetSandbox(sessionID)
+
+       // 2. Call cleanup endpoint on picod
+       err := s.picodClient.Cleanup(sandbox.Endpoint)
+       if err != nil {
+           // Fallback: destroy sandbox
+           return s.destroySandbox(sandbox)
+       }
+
+       // 3. Mark sandbox as available in hot pool
+       s.poolManager.ReturnToHotPool(sandbox)
+
+       // 4. Update store
+       s.store.MarkSandboxAvailable(sessionID)
+
+       return nil
+   }
+   ```
+
+2. **Pool Allocation**: Modified sandbox creation
+   ```go
+   func (s *Server) allocateSandbox(kind, namespace, name string) (*Sandbox, error) {
+       // 1. Try hot pool first
+       if sandbox := s.hotPool.Get(namespace, name); sandbox != nil {
+           return sandbox, nil
+       }
+
+       // 2. Try warm pool
+       if sandbox := s.warmPool.Get(namespace, name); sandbox != nil {
+           return sandbox, nil
+       }
+
+       // 3. Create new (cold pool)
+       return s.createSandbox(kind, namespace, name)
+   }
+   ```
+
+3. **Picod Cleanup Implementation**: In `pkg/picod/cleanup.go`
+   - Implements cleanup endpoint
+   - Secure process termination
+   - Workspace sanitization
+
+### Security Considerations
+
+1. **State Isolation**: Ensure complete cleanup between sessions
+2. **Resource Limits**: Prevent cleanup from consuming excessive resources
+3. **Timeout**: Maximum cleanup duration (30 seconds)
+4. **Audit Logging**: Log all cleanup operations
+
+---
+
+## Architecture Overview
+
+```
+┌────────────────────────────────────────────────────────────────────┐
+│                         Router                                       │
+│  ┌─────────────┐  ┌──────────────┐  ┌─────────────────────────┐    │
+│  │Traffic Collec│  │Session Manager│  │Request Forwarder       │    │
+│  └─────────────┘  └──────────────┘  └─────────────────────────┘    │
+└────────────────────────────────────────────────────────────────────┘
+                                │
+                                ▼
+┌────────────────────────────────────────────────────────────────────┐
+│                      Workload Manager                               │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐ │
+│  │ Pool Manager │  │Traffic Predic│  │  Sandbox Reuse Manager   │ │
+│  │              │◄─┤   tor        │  │                          │ │
+│  └──────────────┘  └──────────────┘  └──────────────────────────┘ │
+│        │                                                    │       │
+│        ▼                                                    ▼       │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐ │
+│  │  Hot Pool    │  │  Warm Pool   │  │  Garbage Collector       │ │
+│  │  Controller  │  │  Controller  │  │  (cleanup fallback)      │ │
+│  └──────────────┘  └──────────────┘  └──────────────────────────┘ │
+└────────────────────────────────────────────────────────────────────┘
+                                │
+                                ▼
+┌────────────────────────────────────────────────────────────────────┐
+│                      Kubernetes API                                 │
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐ │
+│  │SandboxHotPool│  │SandboxWarmPol│  │  SandboxTemplate         │ │
+│  │     CRD      │  │     CRD      │  │         CRD              │ │
+│  └──────────────┘  └──────────────┘  └──────────────────────────┘ │
+└────────────────────────────────────────────────────────────────────┘
+```
+
+## Implementation Plan
+
+### Phase 1: Sandbox Reuse (Week 1-2)
+1. Add cleanup endpoint to Picod
+2. Implement SessionEndHandler in workload manager
+3. Modify sandbox allocation to check hot pool first
+4. Add configuration flags
+
+### Phase 2: Multi-Level Pool (Week 3-4)
+1. Define SandboxHotPool CRD
+2. Implement HotPoolController
+3. Create PoolManager for unified pool access
+4. Update CodeInterpreter API to support multi-level configuration
+
+### Phase 3: Dynamic Sizing (Week 5-6)
+1. Implement traffic collector in Router
+2. Create TrafficPredictor with simple algorithm
+3. Implement PoolAutoscaler
+4. Add monitoring and alerting
+
+## Metrics and Monitoring
+
+New metrics exposed:
+
+```
+# Hot Pool metrics
+agentcube_hot_pool_size{namespace, name}
+agentcube_hot_pool_available{namespace, name}
+agentcube_hot_pool_hit_rate{namespace, name}
+
+# Warm Pool metrics  


This is a great design document. I found a few minor typos that could be fixed to improve clarity:

Line 194: Trailing spaces in the comment: # Scale-down cooldown

Architecture Diagram (Lines 324-353):

Traffic Collec should be Traffic Collector.

Traffic Predic and tor should be Traffic Predictor.

SandboxWarmPol should be SandboxWarmPool.

Line 386: Trailing spaces in the comment: # Warm Pool metrics

gemini-code-assist · 2026-02-24T06:44:34Z

docs/design/enhanced-warm-pool-design.md

+1. **State Isolation**: Ensure complete cleanup between sessions
+2. **Resource Limits**: Prevent cleanup from consuming excessive resources
+3. **Timeout**: Maximum cleanup duration (30 seconds)
+4. **Audit Logging**: Log all cleanup operations


The security considerations for sandbox reuse are well-thought-out. To further strengthen the security posture, it would be beneficial to explicitly mention the strategy of limiting the number of times a sandbox can be reused. This mitigates risks from subtle, long-term state leakage or potential exploits that might persist across cleanups. The maxReuseCount parameter, which is already included in the configuration example, is an excellent implementation of this and should be highlighted in this section.

gemini-code-assist · 2026-02-24T06:44:34Z

docs/design/enhanced-warm-pool-design.md

+## Configuration Examples
+
+### Minimal (reuse only)
+```yaml
+apiVersion: runtime.agentcube.volcano.sh/v1alpha1
+kind: CodeInterpreter
+metadata:
+  name: my-interpreter
+spec:
+  warmPool:
+    staticSize: 10
+  reuse:
+    enabled: true
+```
+
+### Full featured
+```yaml
+apiVersion: runtime.agentcube.volcano.sh/v1alpha1
+kind: CodeInterpreter
+metadata:
+  name: my-interpreter
+spec:
+  warmPool:
+    dynamicSizing:
+      enabled: true
+      minSize: 5
+      maxSize: 50
+      predictionWindow: 30m
+  hotPool:
+    enabled: true
+    minSize: 3
+    maxSize: 10
+    idleTimeout: 5m
+  reuse:
+    enabled: true
+    cleanupTimeout: 30s
+    maxReuseCount: 100  # Destroy after N reuses
+```


The configuration examples introduce new top-level fields in the CodeInterpreter spec, namely dynamicSizing (within warmPool), hotPool, and reuse. These are significant API changes. The proposal should explicitly state that the CodeInterpreter CRD and its corresponding CodeInterpreterSpec Go type will be updated to include these new configuration sections. This will provide essential clarity for both implementers and users regarding the upcoming API evolution.

codecov-commenter · 2026-02-24T06:49:57Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.04%. Comparing base (845b798) to head (2bf7c02).
⚠️ Report is 129 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #224      +/-   ##
==========================================
+ Coverage   35.60%   42.04%   +6.43%     
==========================================
  Files          29       30       +1     
  Lines        2533     2590      +57     
==========================================
+ Hits          902     1089     +187     
+ Misses       1505     1369     -136     
- Partials      126      132       +6

Flag	Coverage Δ
unittests	`42.04% <ø> (+6.43%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR introduces a comprehensive design proposal for enhancing AgentCube's warm pool functionality. The proposal aims to address current limitations in static pool sizing, single-level pool architecture, and lack of sandbox reuse by introducing three major enhancements: dynamic pool sizing based on traffic prediction, a multi-level hot/warm/cold pool hierarchy, and sandbox reuse through state cleanup mechanisms.

Changes:

Adds a detailed design document proposing dynamic warm pool sizing with traffic prediction algorithms
Introduces a multi-level pool architecture (Hot/Warm/Cold) with new SandboxHotPool CRD
Proposes sandbox reuse strategy with cleanup protocols to reduce cold-start latency

Copilot · 2026-02-24T06:50:08Z

docs/design/enhanced-warm-pool-design.md

+1. **Session End Handler**: In workload manager
+   ```go
+   func (s *Server) handleSessionEnd(sessionID string) error {
+       // 1. Get sandbox info
+       sandbox := s.store.GetSandbox(sessionID)
+
+       // 2. Call cleanup endpoint on picod
+       err := s.picodClient.Cleanup(sandbox.Endpoint)
+       if err != nil {
+           // Fallback: destroy sandbox
+           return s.destroySandbox(sandbox)
+       }
+
+       // 3. Mark sandbox as available in hot pool
+       s.poolManager.ReturnToHotPool(sandbox)
+
+       // 4. Update store
+       s.store.MarkSandboxAvailable(sessionID)
+
+       return nil
+   }
+   ```
+
+2. **Pool Allocation**: Modified sandbox creation
+   ```go
+   func (s *Server) allocateSandbox(kind, namespace, name string) (*Sandbox, error) {
+       // 1. Try hot pool first
+       if sandbox := s.hotPool.Get(namespace, name); sandbox != nil {
+           return sandbox, nil
+       }
+
+       // 2. Try warm pool
+       if sandbox := s.warmPool.Get(namespace, name); sandbox != nil {
+           return sandbox, nil
+       }
+
+       // 3. Create new (cold pool)
+       return s.createSandbox(kind, namespace, name)
+   }
+   ```


The proposal shows code examples with undefined types and functions like "s.picodClient.Cleanup()", "s.poolManager.ReturnToHotPool()", "s.hotPool.Get()", and "s.warmPool.Get()" that don't exist in the current codebase. While this is expected for a design proposal, the implementation details should clarify these are new components to be created and provide more specifics about their interfaces, particularly how they integrate with the existing controller-runtime patterns used in pkg/workloadmanager.

Copilot · 2026-02-24T06:50:08Z

docs/design/enhanced-warm-pool-design.md

+4. Update CodeInterpreter API to support multi-level configuration
+
+### Phase 3: Dynamic Sizing (Week 5-6)
+1. Implement traffic collector in Router


The architecture diagram shows "Traffic Collec" (likely "Traffic Collector") as a component in Router, but the implementation plan in Phase 3 mentions "Implement traffic collector in Router" without specifying how this integrates with the existing Gin-based router in pkg/router/server.go. Consider documenting whether this will be a middleware, a separate goroutine, or part of existing request handlers, and how it avoids adding latency to request processing.

Suggested change

1. Implement traffic collector in Router

1. Implement traffic collector in Router

- Add a lightweight Gin middleware in `pkg/router/server.go` that attaches to the existing invocation routes and records per-request metadata (namespace, runtime name, route, status code, request/response size, duration).

- The middleware must perform only cheap, in-memory operations on the hot path: timestamping, basic field extraction, and enqueueing into a buffered channel; it must not perform any blocking I/O or heavy computation.

- Start a background worker (goroutine) when initializing the Router that reads from this channel and aggregates traffic data into in-memory counters/time-series used by the `TrafficPredictor` and exported as metrics.

- Ensure the enqueue operation in the middleware is non-blocking (for example, using a `select` with a `default` case when the channel is full) so that traffic collection cannot add latency or backpressure to request processing.

Copilot · 2026-02-24T06:50:08Z

docs/design/enhanced-warm-pool-design.md

+    historyStore HistoryStore
+    // Prediction window
+    predictionWindow time.Duration
+    // Minimum samples for prediction


The TrafficPredictor struct includes a "minSamples" field, but there's no guidance on what happens when insufficient samples are available (e.g., during initial deployment or after low-traffic periods). Consider documenting the fallback behavior: should it use a conservative default pool size, maintain the current size, or apply a specific bootstrap strategy?

Suggested change

// Minimum samples for prediction

// Minimum samples for prediction. When the available samples in the historyStore

// are below this threshold, the predictor does not attempt to compute a new

// demand estimate. Instead, it falls back to a conservative bootstrap strategy:

// use a safe default or maintain the current pool size and mark any derived

// PredictionResult as low-confidence. This avoids overreacting to sparse or

// startup traffic while the history is still being populated.

Copilot · 2026-02-24T06:50:09Z

docs/design/enhanced-warm-pool-design.md

+1. **State Isolation**: Ensure complete cleanup between sessions
+2. **Resource Limits**: Prevent cleanup from consuming excessive resources
+3. **Timeout**: Maximum cleanup duration (30 seconds)
+4. **Audit Logging**: Log all cleanup operations


The security section mentions "State Isolation" and "Audit Logging" as critical requirements, but doesn't provide concrete implementation details or specify how these should be verified. Given the security implications of sandbox reuse (potential cross-session data leakage), this section should include more specific requirements such as: what exactly needs to be cleaned (file handles, shared memory, IPC objects, network sockets), how to verify cleanup completeness, and what audit logs should include (session IDs, cleanup results, any failures).

Suggested change

1. **State Isolation**: Ensure complete cleanup between sessions

2. **Resource Limits**: Prevent cleanup from consuming excessive resources

3. **Timeout**: Maximum cleanup duration (30 seconds)

4. **Audit Logging**: Log all cleanup operations

1. **State Isolation**: Ensure complete cleanup between sessions before a sandbox is returned to any warm/hot pool

- **Process state**: terminate all user/session processes; only the sandbox supervisor/runtime (e.g. picod) may remain

- **File descriptors**: close all non-essential file handles (no leftover open files, pipes, or device handles from the previous session)

- **Filesystem/workspace**:

- remove all session-specific files and directories from the workspace, including temporary files and scratch data

- clear any per-session configuration, credentials, and secrets from disk

- ensure only the base image and expected runtime artifacts remain

- **IPC and shared memory**: remove/cleanup shared memory segments, UNIX domain sockets, named pipes, message queues, and any other IPC objects created by the session

- **Network state**: close all session-created network sockets and listeners; ensure no ports remain bound on the sandbox interface except those required by the control plane

- **In-memory state**: clear per-session in-memory caches, environment variables, and any other transient state holding user data or secrets

- **Verification before reuse**:

- the cleanup endpoint must run a deterministic set of checks (e.g. no extra processes, no unexpected open FDs, workspace matches expected baseline)

- if any verification check fails, the sandbox MUST NOT be returned to a warm/hot pool and SHOULD be destroyed instead

- the cleanup response MUST indicate success/failure and the reason for any failure, so the workload manager can enforce policy

2. **Resource Limits**: Prevent cleanup from consuming excessive resources

- enforce CPU and memory limits for the cleanup routine (e.g. via cgroup or microVM limits)

- limit maximum workspace size to be scanned/removed per cleanup to avoid pathological cases

- ensure cleanup does not starve active sandboxes (e.g. cap concurrent cleanups per node)

3. **Timeout**: Maximum cleanup duration (30 seconds)

- if cleanup does not complete within the timeout, treat it as a failure and destroy the sandbox instead of reusing it

- timeouts and failures must be surfaced to operators via metrics and logs

4. **Audit Logging**: Log all cleanup operations in both workload manager and picod where applicable

- include at minimum: session ID, sandbox ID, namespace/name, runtime kind, and node identifier

- record timestamps for when cleanup started and completed, and the total duration

- capture the cleanup result (success/failure), verification status, and whether the sandbox was:

- returned to hot/warm pool, or

- destroyed/terminated due to cleanup failure or timeout

- log any errors encountered during cleanup (e.g. failure to delete files, kill processes, or close sockets), including enough detail for incident response without leaking user content

- ensure logs are structured (e.g. JSON) to support automated analysis and correlation across components

Copilot · 2026-02-24T06:50:09Z

docs/design/enhanced-warm-pool-design.md

+- PeriodicPattern: Time-of-day adjustment based on historical patterns
+- Trend: Linear trend from recent data
+```
+


The prediction algorithm mentions "Time-of-day adjustment based on historical patterns" but doesn't explain how this handles timezone considerations or different usage patterns across global deployments. Consider documenting whether traffic patterns are stored/calculated in UTC, local cluster time, or user timezone, and how the system handles multi-region scenarios where different regions may have different traffic patterns.

Suggested change

**Timezone and multi-region semantics**

- All `TrafficSample.Timestamp` values and historical aggregates are stored and processed in UTC.

- Time-of-day buckets used for `PeriodicPattern` are defined in UTC; when needed, a cluster or region may interpret predictions in its configured local timezone by applying the appropriate offset at evaluation time, without mutating stored data.

- Multi-region deployments maintain independent traffic patterns per region/cluster (e.g., keyed by cluster or region identifier), so that differing local peak hours do not get merged into a single global pattern.

- If a deployment needs user-local time semantics, this is modeled via configuration (for example, per-runtime or per-namespace “local region/timezone”), but the canonical storage and computation layer remains UTC-based.

Copilot · 2026-02-24T06:50:10Z

docs/design/enhanced-warm-pool-design.md

+    # Or static size (mutually exclusive)
+    # staticSize: 10


The proposal shows mutually exclusive configuration between "dynamicSizing" and "staticSize" with a comment, but doesn't specify how this mutual exclusivity is enforced. Consider adding a kubebuilder validation annotation or webhook validation to prevent users from accidentally configuring both, or document the precedence rules if both are specified.

Copilot · 2026-02-24T06:50:10Z

docs/design/enhanced-warm-pool-design.md

+    HotPool --> WarmPool: Idle timeout
+    WarmPool --> ColdPool: Scale down / low traffic
+
+    InUse --> WarmPool: Explicit release


The Pool Transition Flow diagram shows "InUse --> WarmPool: Explicit release" as a possible transition, but the sandbox reuse section states sandboxes return to the Hot Pool after cleanup. This inconsistency needs clarification: under what conditions does a sandbox go to Warm Pool versus Hot Pool after being used? Consider aligning the state diagram with the detailed flow described in the Sandbox Reuse Strategy section.

Suggested change

InUse --> WarmPool: Explicit release

%% After session cleanup, sandboxes always return to HotPool;

%% demotion to WarmPool happens only via HotPool --> WarmPool.

Copilot · 2026-02-24T06:50:11Z

docs/design/enhanced-warm-pool-design.md

+### Phase 1: Sandbox Reuse (Week 1-2)
+1. Add cleanup endpoint to Picod
+2. Implement SessionEndHandler in workload manager
+3. Modify sandbox allocation to check hot pool first
+4. Add configuration flags
+
+### Phase 2: Multi-Level Pool (Week 3-4)
+1. Define SandboxHotPool CRD
+2. Implement HotPoolController
+3. Create PoolManager for unified pool access
+4. Update CodeInterpreter API to support multi-level configuration
+
+### Phase 3: Dynamic Sizing (Week 5-6)
+1. Implement traffic collector in Router
+2. Create TrafficPredictor with simple algorithm
+3. Implement PoolAutoscaler


The implementation timeline shows "Week 1-2", "Week 3-4", "Week 5-6" but these phases have dependencies (e.g., Phase 2 needs PoolManager which is used by Phase 1's SessionEndHandler). Consider clarifying whether these are sequential (12 weeks total) or if there's expected parallelization, and document the critical path dependencies between phases.

Suggested change

### Phase 1: Sandbox Reuse (Week 1-2)

1. Add cleanup endpoint to Picod

2. Implement SessionEndHandler in workload manager

3. Modify sandbox allocation to check hot pool first

4. Add configuration flags

### Phase 2: Multi-Level Pool (Week 3-4)

1. Define SandboxHotPool CRD

2. Implement HotPoolController

3. Create PoolManager for unified pool access

4. Update CodeInterpreter API to support multi-level configuration

### Phase 3: Dynamic Sizing (Week 5-6)

1. Implement traffic collector in Router

2. Create TrafficPredictor with simple algorithm

3. Implement PoolAutoscaler

### Phase 1: Sandbox Reuse (Weeks 1-2)

Phases are primarily sequential (total ~6 elapsed weeks), with limited parallelization where dependencies allow. Work in later phases must not start on items that depend on components introduced in earlier phases (e.g., `PoolManager` and reuse wiring).

1. Create PoolManager for unified pool access (initially targeting the existing warm pool)

2. Add cleanup endpoint to Picod

3. Implement SessionEndHandler in workload manager using PoolManager

4. Modify sandbox allocation to check the hot/warm pool via PoolManager before provisioning new sandboxes

5. Add configuration flags

### Phase 2: Multi-Level Pool (Weeks 3-4)

1. Define SandboxHotPool CRD

2. Implement HotPoolController

3. Integrate HotPoolController with PoolManager for unified multi-level pool access

4. Update CodeInterpreter API to support multi-level configuration

### Phase 3: Dynamic Sizing (Weeks 5-6)

1. Implement traffic collector in Router

2. Create TrafficPredictor with simple algorithm

3. Implement PoolAutoscaler leveraging PoolManager and multi-level pool metrics

Copilot · 2026-02-24T06:50:11Z

docs/design/enhanced-warm-pool-design.md

+#### SandboxHotPool CRD
+
+```yaml
+apiVersion: extensions.agent.x-k8s.io/v1alpha1


The apiVersion for SandboxHotPool CRD uses "extensions.agent.x-k8s.io/v1alpha1", but based on the codebase, the existing agent-sandbox CRDs (like SandboxWarmPool, SandboxTemplate, and SandboxClaim) use the API group from "sigs.k8s.io/agent-sandbox/extensions/api/v1alpha1". This inconsistency could cause confusion. Consider aligning with the existing agent-sandbox API group pattern, or clarify if this is a new separate API group and explain the reasoning.

Suggested change

apiVersion: extensions.agent.x-k8s.io/v1alpha1

apiVersion: sigs.k8s.io/agent-sandbox/extensions/api/v1alpha1

Copilot · 2026-02-24T06:50:11Z

docs/design/enhanced-warm-pool-design.md

+  reuse:
+    enabled: true
+    cleanupTimeout: 30s
+    maxReuseCount: 100  # Destroy after N reuses


The configuration shows "maxReuseCount: 100" to destroy sandboxes after N reuses, but there's no explanation of why this limit is needed or how to choose an appropriate value. From a security and operational perspective, document the reasoning (e.g., prevent memory leaks, ensure fresh state periodically) and provide guidance on tuning this value based on workload characteristics.

hzxuzhonghu

Thanks for the good input, from my side i think it is too coarse to get the complete flow. Can you please supplement first?

hzxuzhonghu · 2026-02-24T09:09:50Z

docs/design/enhanced-warm-pool-design.md

+The current warm pool implementation has several limitations:
+
+1. **Static Pool Size**: `WarmPoolSize` is statically configured, cannot adapt to traffic patterns
+2. **Single-Level Pool**: Only one warm pool level, no differentiation between ready-to-use and quick-start states


I donot quite understand this

hzxuzhonghu · 2026-02-24T09:20:02Z

docs/design/enhanced-warm-pool-design.md

+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                     Cold Pool (Infinite)                    │


Suggest not use pool concept here, cause confusion

hzxuzhonghu · 2026-02-24T09:20:15Z

docs/design/enhanced-warm-pool-design.md

+                           │
+┌─────────────────────────────────────────────────────────────┐
+│                    Warm Pool (10-20 sandboxes)              │
+│  - Image pulled, container created but paused               │


how to achive this

hzxuzhonghu · 2026-02-24T09:23:26Z

docs/design/enhanced-warm-pool-design.md

+1. **Hot Pool Controller**: New controller in `pkg/workloadmanager/hotpool_controller.go`
+   - Maintains a pool of ready sandboxes
+   - Monitors pool size and creates/destroys as needed
+   - Integrates with traffic predictor


maybe you can first elaborate all the new concepts before the API design section

hzxuzhonghu · 2026-02-24T09:25:49Z

docs/design/enhanced-warm-pool-design.md

+  availableReplicas: 3
+```
+
+### Pool Transition Flow


For this part, i want to know how to make the sandbox transit between WarmPool and HotPool, i want to see this design

hzxuzhonghu · 2026-02-24T09:36:44Z

docs/design/enhanced-warm-pool-design.md

+   - Integrates with traffic predictor
+
+2. **Pool Manager**: New component in `pkg/workloadmanager/pool_manager.go`
+   - Manages transitions between Hot/Warm/Cold pools
+   - Provides unified interface for sandbox allocation
+   - Handles pool overflow/underflow
+
+---
+
+## 2. Dynamic Warm Pool Sizing
+
+### Traffic Prediction
+
+Based on historical data, predict future demand:
+
+```go
+type TrafficPredictor struct {
+    // Historical data store
+    historyStore HistoryStore


what does it store

hzxuzhonghu · 2026-02-24T09:37:19Z

docs/design/enhanced-warm-pool-design.md

+
+### Traffic Prediction
+
+Based on historical data, predict future demand:


please add some more details

data plane traffic can be known only by router, but this proposal targets on the workload manager

hzxuzhonghu · 2026-02-24T09:41:22Z

docs/design/enhanced-warm-pool-design.md

+      predictionWindow: 30m
+      # Scale-up cooldown
+      scaleUpCooldown: 5m
+      # Scale-down cooldown  


make them understandable

hzxuzhonghu · 2026-02-24T09:49:16Z

docs/design/enhanced-warm-pool-design.md

+       }
+
+       // 2. Try warm pool
+       if sandbox := s.warmPool.Get(namespace, name); sandbox != nil {


As the warmpool sandbox is paused, donot you need resume first?

Copilot AI review requested due to automatic review settings February 24, 2026 06:43

volcano-sh-bot requested review from YaoZengzeng and hzxuzhonghu February 24, 2026 06:43

volcano-sh-bot added the size/L label Feb 24, 2026

Copilot started reviewing on behalf of Tweakzx February 24, 2026 06:43 View session

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

Copilot AI reviewed Feb 24, 2026

View reviewed changes

hzxuzhonghu reviewed Feb 24, 2026

View reviewed changes

-. Implement traffic collector in Router
+. Implement traffic collector in Router
+   - Add a lightweight Gin middleware in `pkg/router/server.go` that attaches to the existing invocation routes and records per-request metadata (namespace, runtime name, route, status code, request/response size, duration).
+   - The middleware must perform only cheap, in-memory operations on the hot path: timestamping, basic field extraction, and enqueueing into a buffered channel; it must not perform any blocking I/O or heavy computation.
+   - Start a background worker (goroutine) when initializing the Router that reads from this channel and aggregates traffic data into in-memory counters/time-series used by the `TrafficPredictor` and exported as metrics.
+   - Ensure the enqueue operation in the middleware is non-blocking (for example, using a `select` with a `default` case when the channel is full) so that traffic collection cannot add latency or backpressure to request processing.

-    // Minimum samples for prediction
+    // Minimum samples for prediction. When the available samples in the historyStore
+    // are below this threshold, the predictor does not attempt to compute a new
+    // demand estimate. Instead, it falls back to a conservative bootstrap strategy:
+    // use a safe default or maintain the current pool size and mark any derived
+    // PredictionResult as low-confidence. This avoids overreacting to sparse or
+    // startup traffic while the history is still being populated.

-. **State Isolation**: Ensure complete cleanup between sessions
-. **Resource Limits**: Prevent cleanup from consuming excessive resources
-. **Timeout**: Maximum cleanup duration (30 seconds)
-. **Audit Logging**: Log all cleanup operations
+. **State Isolation**: Ensure complete cleanup between sessions before a sandbox is returned to any warm/hot pool
+   - **Process state**: terminate all user/session processes; only the sandbox supervisor/runtime (e.g. picod) may remain
+   - **File descriptors**: close all non-essential file handles (no leftover open files, pipes, or device handles from the previous session)
+   - **Filesystem/workspace**:
+     - remove all session-specific files and directories from the workspace, including temporary files and scratch data
+     - clear any per-session configuration, credentials, and secrets from disk
+     - ensure only the base image and expected runtime artifacts remain
+   - **IPC and shared memory**: remove/cleanup shared memory segments, UNIX domain sockets, named pipes, message queues, and any other IPC objects created by the session
+   - **Network state**: close all session-created network sockets and listeners; ensure no ports remain bound on the sandbox interface except those required by the control plane
+   - **In-memory state**: clear per-session in-memory caches, environment variables, and any other transient state holding user data or secrets
+   - **Verification before reuse**:
+     - the cleanup endpoint must run a deterministic set of checks (e.g. no extra processes, no unexpected open FDs, workspace matches expected baseline)
+     - if any verification check fails, the sandbox MUST NOT be returned to a warm/hot pool and SHOULD be destroyed instead
+     - the cleanup response MUST indicate success/failure and the reason for any failure, so the workload manager can enforce policy
+. **Resource Limits**: Prevent cleanup from consuming excessive resources
+   - enforce CPU and memory limits for the cleanup routine (e.g. via cgroup or microVM limits)
+   - limit maximum workspace size to be scanned/removed per cleanup to avoid pathological cases
+   - ensure cleanup does not starve active sandboxes (e.g. cap concurrent cleanups per node)
+. **Timeout**: Maximum cleanup duration (30 seconds)
+   - if cleanup does not complete within the timeout, treat it as a failure and destroy the sandbox instead of reusing it
+   - timeouts and failures must be surfaced to operators via metrics and logs
+. **Audit Logging**: Log all cleanup operations in both workload manager and picod where applicable
+   - include at minimum: session ID, sandbox ID, namespace/name, runtime kind, and node identifier
+   - record timestamps for when cleanup started and completed, and the total duration
+   - capture the cleanup result (success/failure), verification status, and whether the sandbox was:
+     - returned to hot/warm pool, or
+     - destroyed/terminated due to cleanup failure or timeout
+   - log any errors encountered during cleanup (e.g. failure to delete files, kill processes, or close sockets), including enough detail for incident response without leaking user content
+   - ensure logs are structured (e.g. JSON) to support automated analysis and correlation across components

+**Timezone and multi-region semantics**
+- All `TrafficSample.Timestamp` values and historical aggregates are stored and processed in UTC.
+- Time-of-day buckets used for `PeriodicPattern` are defined in UTC; when needed, a cluster or region may interpret predictions in its configured local timezone by applying the appropriate offset at evaluation time, without mutating stored data.
+- Multi-region deployments maintain independent traffic patterns per region/cluster (e.g., keyed by cluster or region identifier), so that differing local peak hours do not get merged into a single global pattern.
+- If a deployment needs user-local time semantics, this is modeled via configuration (for example, per-runtime or per-namespace “local region/timezone”), but the canonical storage and computation layer remains UTC-based.

	InUse --> WarmPool: Explicit release
	%% After session cleanup, sandboxes always return to HotPool;
	%% demotion to WarmPool happens only via HotPool --> WarmPool.

	apiVersion: extensions.agent.x-k8s.io/v1alpha1
	apiVersion: sigs.k8s.io/agent-sandbox/extensions/api/v1alpha1


		### Traffic Prediction

		Based on historical data, predict future demand:

Conversation

Tweakzx commented Feb 24, 2026

Uh oh!

volcano-sh-bot commented Feb 24, 2026

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Feb 24, 2026 •

edited

Loading