Skip to content

proposal: Enhanced Warm Pool Design#224

Open
Tweakzx wants to merge 1 commit intovolcano-sh:mainfrom
Tweakzx:proposal/enhanced-warm-pool
Open

proposal: Enhanced Warm Pool Design#224
Tweakzx wants to merge 1 commit intovolcano-sh:mainfrom
Tweakzx:proposal/enhanced-warm-pool

Conversation

@Tweakzx
Copy link
Copy Markdown
Contributor

@Tweakzx Tweakzx commented Feb 24, 2026

This proposal introduces three enhancements to the existing warm pool:

  1. Dynamic Pool Sizing: Automatically adjust warm pool size based on historical traffic patterns using a simple prediction algorithm.

  2. Multi-Level Warm Pool: Implement Hot/Warm/Cold pool hierarchy:

    • Hot Pool: Fully started, ready to serve immediately (3-5 sandboxes)
    • Warm Pool: Image pulled, container paused, fast startup (10-20 sandboxes)
    • Cold Pool: Only configuration stored, on-demand creation
  3. Sandbox Reuse Strategy: Reuse sandboxes across sessions by cleaning state instead of destroying, similar to database connection pooling.

Key Components:

  • PoolManager: Unified pool management with Hot/Warm/Cold transitions
  • TrafficPredictor: Traffic prediction based on historical patterns
  • PoolAutoscaler: Automatic pool size adjustment based on predictions
  • Picod cleanup endpoint: State cleanup for sandbox reuse

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


This proposal introduces three enhancements to the existing warm pool:

1. **Dynamic Pool Sizing**: Automatically adjust warm pool size based on
   historical traffic patterns using a simple prediction algorithm.

2. **Multi-Level Warm Pool**: Implement Hot/Warm/Cold pool hierarchy:
   - Hot Pool: Fully started, ready to serve immediately (3-5 sandboxes)
   - Warm Pool: Image pulled, container paused, fast startup (10-20 sandboxes)
   - Cold Pool: Only configuration stored, on-demand creation

3. **Sandbox Reuse Strategy**: Reuse sandboxes across sessions by cleaning
   state instead of destroying, similar to database connection pooling.

Key Components:
- PoolManager: Unified pool management with Hot/Warm/Cold transitions
- TrafficPredictor: Traffic prediction based on historical patterns
- PoolAutoscaler: Automatic pool size adjustment based on predictions
- Picod cleanup endpoint: State cleanup for sandbox reuse
Copilot AI review requested due to automatic review settings February 24, 2026 06:43
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @Tweakzx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request proposes a comprehensive enhancement to the existing warm pool system. The primary goal is to significantly improve resource utilization and reduce cold-start latencies by introducing dynamic pool sizing based on traffic predictions, a multi-tiered Hot/Warm/Cold pool structure, and a mechanism for reusing sandboxes after session termination through state cleanup. These changes aim to make the system more adaptive, efficient, and performant under varying loads.

Highlights

  • Dynamic Pool Sizing: Introduced automatic adjustment of warm pool size based on historical traffic patterns using a simple prediction algorithm, replacing static configurations.
  • Multi-Level Warm Pool: Implemented a Hot/Warm/Cold pool hierarchy to optimize resource usage and reduce latency, featuring fully started 'Hot' sandboxes, paused 'Warm' sandboxes, and configuration-only 'Cold' sandboxes.
  • Sandbox Reuse Strategy: Enabled the reuse of sandboxes across sessions by cleaning their state instead of destroying them, similar to database connection pooling, to achieve zero cold-start latency and improve resource efficiency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/design/enhanced-warm-pool-design.md
    • Added a detailed design document outlining the motivation, goals, non-goals, API design, implementation details, architecture overview, implementation plan, metrics, and configuration examples for the enhanced warm pool.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive and well-structured design proposal for an enhanced warm pool system. The document clearly outlines the motivation, goals, and key features, including a multi-level pool hierarchy, dynamic sizing, and a sandbox reuse strategy. The design is sound, and the phased implementation plan is practical. My feedback focuses on strengthening the security section, clarifying the necessary API changes for the CodeInterpreter CRD, and correcting a few minor typos to improve the document's clarity.

Comment on lines +194 to +386
# Scale-down cooldown
scaleDownCooldown: 15m
# Safety margin (predict * 1.2)
safetyMargin: 1.2
# Or static size (mutually exclusive)
# staticSize: 10
```

### Implementation

1. **Traffic Collector**: Collect and store traffic metrics
- Integrated into Router for request counting
- Persists to Redis with TTL

2. **Traffic Predictor**: Analyze and predict
- Runs periodically (every 5 minutes)
- Outputs recommended pool sizes

3. **Pool Autoscaler**: Apply predictions
- Updates WarmPool and HotPool sizes
- Respects cooldown periods
- Gradual scale-up/down

---

## 3. Sandbox Reuse Strategy

### Concept

Instead of destroying sandboxes after session ends, clean their state and return to pool:

```
Session End → State Cleanup → Return to Hot Pool
├── Clear /workspace directory
├── Reset environment variables
├── Kill user processes
└── Clear network state
```

### Benefits

1. **Zero Cold-Start**: Reused sandboxes are immediately available
2. **Resource Efficiency**: Avoid repeated container creation/destruction
3. **Better Performance**: Cached images, warmed up connections

### State Cleanup Protocol

Picod (inside sandbox) exposes cleanup endpoint:

```go
// POST /internal/cleanup
type CleanupRequest struct {
// Clear all user files
ClearWorkspace bool
// Kill all user processes
KillProcesses bool
// Reset environment
ResetEnv bool
// Clear network connections
ClearNetwork bool
}

type CleanupResponse struct {
Success bool
Message string
Duration time.Duration
}
```

### Implementation

1. **Session End Handler**: In workload manager
```go
func (s *Server) handleSessionEnd(sessionID string) error {
// 1. Get sandbox info
sandbox := s.store.GetSandbox(sessionID)

// 2. Call cleanup endpoint on picod
err := s.picodClient.Cleanup(sandbox.Endpoint)
if err != nil {
// Fallback: destroy sandbox
return s.destroySandbox(sandbox)
}

// 3. Mark sandbox as available in hot pool
s.poolManager.ReturnToHotPool(sandbox)

// 4. Update store
s.store.MarkSandboxAvailable(sessionID)

return nil
}
```

2. **Pool Allocation**: Modified sandbox creation
```go
func (s *Server) allocateSandbox(kind, namespace, name string) (*Sandbox, error) {
// 1. Try hot pool first
if sandbox := s.hotPool.Get(namespace, name); sandbox != nil {
return sandbox, nil
}

// 2. Try warm pool
if sandbox := s.warmPool.Get(namespace, name); sandbox != nil {
return sandbox, nil
}

// 3. Create new (cold pool)
return s.createSandbox(kind, namespace, name)
}
```

3. **Picod Cleanup Implementation**: In `pkg/picod/cleanup.go`
- Implements cleanup endpoint
- Secure process termination
- Workspace sanitization

### Security Considerations

1. **State Isolation**: Ensure complete cleanup between sessions
2. **Resource Limits**: Prevent cleanup from consuming excessive resources
3. **Timeout**: Maximum cleanup duration (30 seconds)
4. **Audit Logging**: Log all cleanup operations

---

## Architecture Overview

```
┌────────────────────────────────────────────────────────────────────┐
│ Router │
│ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │
│ │Traffic Collec│ │Session Manager│ │Request Forwarder │ │
│ └─────────────┘ └──────────────┘ └─────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────┐
│ Workload Manager │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐ │
│ │ Pool Manager │ │Traffic Predic│ │ Sandbox Reuse Manager │ │
│ │ │◄─┤ tor │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐ │
│ │ Hot Pool │ │ Warm Pool │ │ Garbage Collector │ │
│ │ Controller │ │ Controller │ │ (cleanup fallback) │ │
│ └──────────────┘ └──────────────┘ └──────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────┐
│ Kubernetes API │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────────┐ │
│ │SandboxHotPool│ │SandboxWarmPol│ │ SandboxTemplate │ │
│ │ CRD │ │ CRD │ │ CRD │ │
│ └──────────────┘ └──────────────┘ └──────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘
```

## Implementation Plan

### Phase 1: Sandbox Reuse (Week 1-2)
1. Add cleanup endpoint to Picod
2. Implement SessionEndHandler in workload manager
3. Modify sandbox allocation to check hot pool first
4. Add configuration flags

### Phase 2: Multi-Level Pool (Week 3-4)
1. Define SandboxHotPool CRD
2. Implement HotPoolController
3. Create PoolManager for unified pool access
4. Update CodeInterpreter API to support multi-level configuration

### Phase 3: Dynamic Sizing (Week 5-6)
1. Implement traffic collector in Router
2. Create TrafficPredictor with simple algorithm
3. Implement PoolAutoscaler
4. Add monitoring and alerting

## Metrics and Monitoring

New metrics exposed:

```
# Hot Pool metrics
agentcube_hot_pool_size{namespace, name}
agentcube_hot_pool_available{namespace, name}
agentcube_hot_pool_hit_rate{namespace, name}

# Warm Pool metrics
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This is a great design document. I found a few minor typos that could be fixed to improve clarity:

  • Line 194: Trailing spaces in the comment: # Scale-down cooldown
  • Architecture Diagram (Lines 324-353):
    • Traffic Collec should be Traffic Collector.
    • Traffic Predic and tor should be Traffic Predictor.
    • SandboxWarmPol should be SandboxWarmPool.
  • Line 386: Trailing spaces in the comment: # Warm Pool metrics

Comment on lines +314 to +317
1. **State Isolation**: Ensure complete cleanup between sessions
2. **Resource Limits**: Prevent cleanup from consuming excessive resources
3. **Timeout**: Maximum cleanup duration (30 seconds)
4. **Audit Logging**: Log all cleanup operations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The security considerations for sandbox reuse are well-thought-out. To further strengthen the security posture, it would be beneficial to explicitly mention the strategy of limiting the number of times a sandbox can be reused. This mitigates risks from subtle, long-term state leakage or potential exploits that might persist across cleanups. The maxReuseCount parameter, which is already included in the configuration example, is an excellent implementation of this and should be highlighted in this section.

Comment on lines +401 to +438
## Configuration Examples

### Minimal (reuse only)
```yaml
apiVersion: runtime.agentcube.volcano.sh/v1alpha1
kind: CodeInterpreter
metadata:
name: my-interpreter
spec:
warmPool:
staticSize: 10
reuse:
enabled: true
```

### Full featured
```yaml
apiVersion: runtime.agentcube.volcano.sh/v1alpha1
kind: CodeInterpreter
metadata:
name: my-interpreter
spec:
warmPool:
dynamicSizing:
enabled: true
minSize: 5
maxSize: 50
predictionWindow: 30m
hotPool:
enabled: true
minSize: 3
maxSize: 10
idleTimeout: 5m
reuse:
enabled: true
cleanupTimeout: 30s
maxReuseCount: 100 # Destroy after N reuses
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The configuration examples introduce new top-level fields in the CodeInterpreter spec, namely dynamicSizing (within warmPool), hotPool, and reuse. These are significant API changes. The proposal should explicitly state that the CodeInterpreter CRD and its corresponding CodeInterpreterSpec Go type will be updated to include these new configuration sections. This will provide essential clarity for both implementers and users regarding the upcoming API evolution.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Feb 24, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.04%. Comparing base (845b798) to head (2bf7c02).
⚠️ Report is 129 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #224      +/-   ##
==========================================
+ Coverage   35.60%   42.04%   +6.43%     
==========================================
  Files          29       30       +1     
  Lines        2533     2590      +57     
==========================================
+ Hits          902     1089     +187     
+ Misses       1505     1369     -136     
- Partials      126      132       +6     
Flag Coverage Δ
unittests 42.04% <ø> (+6.43%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a comprehensive design proposal for enhancing AgentCube's warm pool functionality. The proposal aims to address current limitations in static pool sizing, single-level pool architecture, and lack of sandbox reuse by introducing three major enhancements: dynamic pool sizing based on traffic prediction, a multi-level hot/warm/cold pool hierarchy, and sandbox reuse through state cleanup mechanisms.

Changes:

  • Adds a detailed design document proposing dynamic warm pool sizing with traffic prediction algorithms
  • Introduces a multi-level pool architecture (Hot/Warm/Cold) with new SandboxHotPool CRD
  • Proposes sandbox reuse strategy with cleanup protocols to reduce cold-start latency

Comment on lines +266 to +305
1. **Session End Handler**: In workload manager
```go
func (s *Server) handleSessionEnd(sessionID string) error {
// 1. Get sandbox info
sandbox := s.store.GetSandbox(sessionID)

// 2. Call cleanup endpoint on picod
err := s.picodClient.Cleanup(sandbox.Endpoint)
if err != nil {
// Fallback: destroy sandbox
return s.destroySandbox(sandbox)
}

// 3. Mark sandbox as available in hot pool
s.poolManager.ReturnToHotPool(sandbox)

// 4. Update store
s.store.MarkSandboxAvailable(sessionID)

return nil
}
```

2. **Pool Allocation**: Modified sandbox creation
```go
func (s *Server) allocateSandbox(kind, namespace, name string) (*Sandbox, error) {
// 1. Try hot pool first
if sandbox := s.hotPool.Get(namespace, name); sandbox != nil {
return sandbox, nil
}

// 2. Try warm pool
if sandbox := s.warmPool.Get(namespace, name); sandbox != nil {
return sandbox, nil
}

// 3. Create new (cold pool)
return s.createSandbox(kind, namespace, name)
}
```
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal shows code examples with undefined types and functions like "s.picodClient.Cleanup()", "s.poolManager.ReturnToHotPool()", "s.hotPool.Get()", and "s.warmPool.Get()" that don't exist in the current codebase. While this is expected for a design proposal, the implementation details should clarify these are new components to be created and provide more specifics about their interfaces, particularly how they integrate with the existing controller-runtime patterns used in pkg/workloadmanager.

Copilot uses AI. Check for mistakes.
4. Update CodeInterpreter API to support multi-level configuration

### Phase 3: Dynamic Sizing (Week 5-6)
1. Implement traffic collector in Router
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The architecture diagram shows "Traffic Collec" (likely "Traffic Collector") as a component in Router, but the implementation plan in Phase 3 mentions "Implement traffic collector in Router" without specifying how this integrates with the existing Gin-based router in pkg/router/server.go. Consider documenting whether this will be a middleware, a separate goroutine, or part of existing request handlers, and how it avoids adding latency to request processing.

Suggested change
1. Implement traffic collector in Router
1. Implement traffic collector in Router
- Add a lightweight Gin middleware in `pkg/router/server.go` that attaches to the existing invocation routes and records per-request metadata (namespace, runtime name, route, status code, request/response size, duration).
- The middleware must perform only cheap, in-memory operations on the hot path: timestamping, basic field extraction, and enqueueing into a buffered channel; it must not perform any blocking I/O or heavy computation.
- Start a background worker (goroutine) when initializing the Router that reads from this channel and aggregates traffic data into in-memory counters/time-series used by the `TrafficPredictor` and exported as metrics.
- Ensure the enqueue operation in the middleware is non-blocking (for example, using a `select` with a `default` case when the channel is full) so that traffic collection cannot add latency or backpressure to request processing.

Copilot uses AI. Check for mistakes.
historyStore HistoryStore
// Prediction window
predictionWindow time.Duration
// Minimum samples for prediction
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TrafficPredictor struct includes a "minSamples" field, but there's no guidance on what happens when insufficient samples are available (e.g., during initial deployment or after low-traffic periods). Consider documenting the fallback behavior: should it use a conservative default pool size, maintain the current size, or apply a specific bootstrap strategy?

Suggested change
// Minimum samples for prediction
// Minimum samples for prediction. When the available samples in the historyStore
// are below this threshold, the predictor does not attempt to compute a new
// demand estimate. Instead, it falls back to a conservative bootstrap strategy:
// use a safe default or maintain the current pool size and mark any derived
// PredictionResult as low-confidence. This avoids overreacting to sparse or
// startup traffic while the history is still being populated.

Copilot uses AI. Check for mistakes.
Comment on lines +314 to +317
1. **State Isolation**: Ensure complete cleanup between sessions
2. **Resource Limits**: Prevent cleanup from consuming excessive resources
3. **Timeout**: Maximum cleanup duration (30 seconds)
4. **Audit Logging**: Log all cleanup operations
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The security section mentions "State Isolation" and "Audit Logging" as critical requirements, but doesn't provide concrete implementation details or specify how these should be verified. Given the security implications of sandbox reuse (potential cross-session data leakage), this section should include more specific requirements such as: what exactly needs to be cleaned (file handles, shared memory, IPC objects, network sockets), how to verify cleanup completeness, and what audit logs should include (session IDs, cleanup results, any failures).

Suggested change
1. **State Isolation**: Ensure complete cleanup between sessions
2. **Resource Limits**: Prevent cleanup from consuming excessive resources
3. **Timeout**: Maximum cleanup duration (30 seconds)
4. **Audit Logging**: Log all cleanup operations
1. **State Isolation**: Ensure complete cleanup between sessions before a sandbox is returned to any warm/hot pool
- **Process state**: terminate all user/session processes; only the sandbox supervisor/runtime (e.g. picod) may remain
- **File descriptors**: close all non-essential file handles (no leftover open files, pipes, or device handles from the previous session)
- **Filesystem/workspace**:
- remove all session-specific files and directories from the workspace, including temporary files and scratch data
- clear any per-session configuration, credentials, and secrets from disk
- ensure only the base image and expected runtime artifacts remain
- **IPC and shared memory**: remove/cleanup shared memory segments, UNIX domain sockets, named pipes, message queues, and any other IPC objects created by the session
- **Network state**: close all session-created network sockets and listeners; ensure no ports remain bound on the sandbox interface except those required by the control plane
- **In-memory state**: clear per-session in-memory caches, environment variables, and any other transient state holding user data or secrets
- **Verification before reuse**:
- the cleanup endpoint must run a deterministic set of checks (e.g. no extra processes, no unexpected open FDs, workspace matches expected baseline)
- if any verification check fails, the sandbox MUST NOT be returned to a warm/hot pool and SHOULD be destroyed instead
- the cleanup response MUST indicate success/failure and the reason for any failure, so the workload manager can enforce policy
2. **Resource Limits**: Prevent cleanup from consuming excessive resources
- enforce CPU and memory limits for the cleanup routine (e.g. via cgroup or microVM limits)
- limit maximum workspace size to be scanned/removed per cleanup to avoid pathological cases
- ensure cleanup does not starve active sandboxes (e.g. cap concurrent cleanups per node)
3. **Timeout**: Maximum cleanup duration (30 seconds)
- if cleanup does not complete within the timeout, treat it as a failure and destroy the sandbox instead of reusing it
- timeouts and failures must be surfaced to operators via metrics and logs
4. **Audit Logging**: Log all cleanup operations in both workload manager and picod where applicable
- include at minimum: session ID, sandbox ID, namespace/name, runtime kind, and node identifier
- record timestamps for when cleanup started and completed, and the total duration
- capture the cleanup result (success/failure), verification status, and whether the sandbox was:
- returned to hot/warm pool, or
- destroyed/terminated due to cleanup failure or timeout
- log any errors encountered during cleanup (e.g. failure to delete files, kill processes, or close sockets), including enough detail for incident response without leaking user content
- ensure logs are structured (e.g. JSON) to support automated analysis and correlation across components

Copilot uses AI. Check for mistakes.
- PeriodicPattern: Time-of-day adjustment based on historical patterns
- Trend: Linear trend from recent data
```

Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prediction algorithm mentions "Time-of-day adjustment based on historical patterns" but doesn't explain how this handles timezone considerations or different usage patterns across global deployments. Consider documenting whether traffic patterns are stored/calculated in UTC, local cluster time, or user timezone, and how the system handles multi-region scenarios where different regions may have different traffic patterns.

Suggested change
**Timezone and multi-region semantics**
- All `TrafficSample.Timestamp` values and historical aggregates are stored and processed in UTC.
- Time-of-day buckets used for `PeriodicPattern` are defined in UTC; when needed, a cluster or region may interpret predictions in its configured local timezone by applying the appropriate offset at evaluation time, without mutating stored data.
- Multi-region deployments maintain independent traffic patterns per region/cluster (e.g., keyed by cluster or region identifier), so that differing local peak hours do not get merged into a single global pattern.
- If a deployment needs user-local time semantics, this is modeled via configuration (for example, per-runtime or per-namespace “local region/timezone”), but the canonical storage and computation layer remains UTC-based.

Copilot uses AI. Check for mistakes.
Comment on lines +198 to +199
# Or static size (mutually exclusive)
# staticSize: 10
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal shows mutually exclusive configuration between "dynamicSizing" and "staticSize" with a comment, but doesn't specify how this mutual exclusivity is enforced. Consider adding a kubebuilder validation annotation or webhook validation to prevent users from accidentally configuring both, or document the precedence rules if both are specified.

Copilot uses AI. Check for mistakes.
HotPool --> WarmPool: Idle timeout
WarmPool --> ColdPool: Scale down / low traffic

InUse --> WarmPool: Explicit release
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Pool Transition Flow diagram shows "InUse --> WarmPool: Explicit release" as a possible transition, but the sandbox reuse section states sandboxes return to the Hot Pool after cleanup. This inconsistency needs clarification: under what conditions does a sandbox go to Warm Pool versus Hot Pool after being used? Consider aligning the state diagram with the detailed flow described in the Sandbox Reuse Strategy section.

Suggested change
InUse --> WarmPool: Explicit release
%% After session cleanup, sandboxes always return to HotPool;
%% demotion to WarmPool happens only via HotPool --> WarmPool.

Copilot uses AI. Check for mistakes.
Comment on lines +358 to +373
### Phase 1: Sandbox Reuse (Week 1-2)
1. Add cleanup endpoint to Picod
2. Implement SessionEndHandler in workload manager
3. Modify sandbox allocation to check hot pool first
4. Add configuration flags

### Phase 2: Multi-Level Pool (Week 3-4)
1. Define SandboxHotPool CRD
2. Implement HotPoolController
3. Create PoolManager for unified pool access
4. Update CodeInterpreter API to support multi-level configuration

### Phase 3: Dynamic Sizing (Week 5-6)
1. Implement traffic collector in Router
2. Create TrafficPredictor with simple algorithm
3. Implement PoolAutoscaler
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation timeline shows "Week 1-2", "Week 3-4", "Week 5-6" but these phases have dependencies (e.g., Phase 2 needs PoolManager which is used by Phase 1's SessionEndHandler). Consider clarifying whether these are sequential (12 weeks total) or if there's expected parallelization, and document the critical path dependencies between phases.

Suggested change
### Phase 1: Sandbox Reuse (Week 1-2)
1. Add cleanup endpoint to Picod
2. Implement SessionEndHandler in workload manager
3. Modify sandbox allocation to check hot pool first
4. Add configuration flags
### Phase 2: Multi-Level Pool (Week 3-4)
1. Define SandboxHotPool CRD
2. Implement HotPoolController
3. Create PoolManager for unified pool access
4. Update CodeInterpreter API to support multi-level configuration
### Phase 3: Dynamic Sizing (Week 5-6)
1. Implement traffic collector in Router
2. Create TrafficPredictor with simple algorithm
3. Implement PoolAutoscaler
### Phase 1: Sandbox Reuse (Weeks 1-2)
Phases are primarily sequential (total ~6 elapsed weeks), with limited parallelization where dependencies allow. Work in later phases must not start on items that depend on components introduced in earlier phases (e.g., `PoolManager` and reuse wiring).
1. Create PoolManager for unified pool access (initially targeting the existing warm pool)
2. Add cleanup endpoint to Picod
3. Implement SessionEndHandler in workload manager using PoolManager
4. Modify sandbox allocation to check the hot/warm pool via PoolManager before provisioning new sandboxes
5. Add configuration flags
### Phase 2: Multi-Level Pool (Weeks 3-4)
1. Define SandboxHotPool CRD
2. Implement HotPoolController
3. Integrate HotPoolController with PoolManager for unified multi-level pool access
4. Update CodeInterpreter API to support multi-level configuration
### Phase 3: Dynamic Sizing (Weeks 5-6)
1. Implement traffic collector in Router
2. Create TrafficPredictor with simple algorithm
3. Implement PoolAutoscaler leveraging PoolManager and multi-level pool metrics

Copilot uses AI. Check for mistakes.
#### SandboxHotPool CRD

```yaml
apiVersion: extensions.agent.x-k8s.io/v1alpha1
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The apiVersion for SandboxHotPool CRD uses "extensions.agent.x-k8s.io/v1alpha1", but based on the codebase, the existing agent-sandbox CRDs (like SandboxWarmPool, SandboxTemplate, and SandboxClaim) use the API group from "sigs.k8s.io/agent-sandbox/extensions/api/v1alpha1". This inconsistency could cause confusion. Consider aligning with the existing agent-sandbox API group pattern, or clarify if this is a new separate API group and explain the reasoning.

Suggested change
apiVersion: extensions.agent.x-k8s.io/v1alpha1
apiVersion: sigs.k8s.io/agent-sandbox/extensions/api/v1alpha1

Copilot uses AI. Check for mistakes.
reuse:
enabled: true
cleanupTimeout: 30s
maxReuseCount: 100 # Destroy after N reuses
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration shows "maxReuseCount: 100" to destroy sandboxes after N reuses, but there's no explanation of why this limit is needed or how to choose an appropriate value. From a security and operational perspective, document the reasoning (e.g., prevent memory leaks, ensure fresh state periodically) and provide guidance on tuning this value based on workload characteristics.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the good input, from my side i think it is too coarse to get the complete flow. Can you please supplement first?

The current warm pool implementation has several limitations:

1. **Static Pool Size**: `WarmPoolSize` is statically configured, cannot adapt to traffic patterns
2. **Single-Level Pool**: Only one warm pool level, no differentiation between ready-to-use and quick-start states
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I donot quite understand this


```
┌─────────────────────────────────────────────────────────────┐
│ Cold Pool (Infinite) │
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest not use pool concept here, cause confusion

┌─────────────────────────────────────────────────────────────┐
│ Warm Pool (10-20 sandboxes) │
│ - Image pulled, container created but paused │
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to achive this

1. **Hot Pool Controller**: New controller in `pkg/workloadmanager/hotpool_controller.go`
- Maintains a pool of ready sandboxes
- Monitors pool size and creates/destroys as needed
- Integrates with traffic predictor
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can first elaborate all the new concepts before the API design section

availableReplicas: 3
```

### Pool Transition Flow
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this part, i want to know how to make the sandbox transit between WarmPool and HotPool, i want to see this design

Comment on lines +120 to +138
- Integrates with traffic predictor

2. **Pool Manager**: New component in `pkg/workloadmanager/pool_manager.go`
- Manages transitions between Hot/Warm/Cold pools
- Provides unified interface for sandbox allocation
- Handles pool overflow/underflow

---

## 2. Dynamic Warm Pool Sizing

### Traffic Prediction

Based on historical data, predict future demand:

```go
type TrafficPredictor struct {
// Historical data store
historyStore HistoryStore
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it store


### Traffic Prediction

Based on historical data, predict future demand:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add some more details

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data plane traffic can be known only by router, but this proposal targets on the workload manager

Comment on lines +191 to +194
predictionWindow: 30m
# Scale-up cooldown
scaleUpCooldown: 5m
# Scale-down cooldown
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make them understandable

}

// 2. Try warm pool
if sandbox := s.warmPool.Get(namespace, name); sandbox != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the warmpool sandbox is paused, donot you need resume first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants