Skip to content

Latest commit

 

History

History
121 lines (88 loc) · 6.65 KB

File metadata and controls

121 lines (88 loc) · 6.65 KB
title
Planner Guide

The Dynamo SLA Planner is an autoscaling controller that adjusts prefill and decode engine replica counts at runtime to meet latency SLAs. It reads traffic signals (Prometheus metrics or load predictor output) and engine performance profiles to decide when to scale up or down.

For a quick overview, see the Planner README. For architecture internals, see Planner Design.

Scaling Modes

The planner supports two scaling modes that can be used independently or together:

  • Throughput-based scaling (enable_throughput_scaling: true): Uses pre-deployment engine interpolation data and traffic prediction to plan capacity. Best for stable, predictable workloads. Requires profiling data generated by the Profiler.
  • Load-based scaling (enable_load_scaling: true): Uses real-time per-worker engine metrics and online regression. Best for bursty or unpredictable traffic. Does not require profiling data.

When to use which:

  • Enable throughput-based scaling whenever profiling data is available. It provides stable, prediction-based capacity planning.
  • Enable load-based scaling when traffic is bursty. It reacts quickly to real-time load changes.
  • Enable both for the best of both worlds: throughput-based provides a capacity floor, load-based handles bursts above it. When both are enabled, use a longer throughput_adjustment_interval.

PlannerConfig Reference

The planner is configured via a PlannerConfig JSON/YAML object. When using the profiler, this is placed under the features.planner section of the DGDR spec:

features:
  planner:
    enable_throughput_scaling: true
    enable_load_scaling: false
    pre_deployment_sweeping_mode: rapid
    mode: disagg
    backend: vllm

Scaling Mode Fields

Field Type Default Description
enable_throughput_scaling bool true Enable throughput-based scaling (requires pre-deployment profiling data).
enable_load_scaling bool true Enable load-based scaling (no pre-deployment profiling data required).

At least one scaling mode must be enabled.

Pre-Deployment Sweeping

Field Type Default Description
pre_deployment_sweeping_mode string rapid How to generate engine interpolation data: rapid (AIC simulation, ~30s), thorough (real GPUs, 2-4h), or none (skip).

When throughput-based scaling is enabled, the planner needs interpolation curves that map ISL to TTFT (prefill) and KV-cache utilization to ITL (decode). The profiler generates this data based on the pre_deployment_sweeping_mode setting. See the Profiler Guide for details on how this data is produced.

Throughput-Based Scaling Settings

Field Type Default Description
throughput_adjustment_interval int 60 Seconds between throughput-based scaling decisions.
min_endpoint int 1 Minimum number of engine endpoints to maintain.
max_gpu_budget int 128 Maximum total GPUs the planner may allocate.
ttft float 2000.0 TTFT SLA target (ms) for scaling decisions.
itl float 30.0 ITL SLA target (ms) for scaling decisions.
no_correction bool false Disable latency correction factor. Auto-disabled when load-based scaling is on.

Load-Based Scaling Settings

Field Type Default Description
load_adjustment_interval int 10 Seconds between load-based scaling decisions. Must be shorter than throughput_adjustment_interval.
load_learning_window int 120 Seconds of history used for online regression.
load_scaling_down_sensitivity int 3 Number of consecutive underutilized intervals before scaling down.
load_metric_samples int 10 Number of metric samples to collect per decision.
load_min_observations int 5 Minimum observations before making scaling decisions.
load_router_metrics_url string null Router metrics endpoint. Required outside Kubernetes mode.

General Settings

Field Type Default Description
mode string disagg Planner mode: disagg, prefill, decode, or agg.
backend string vllm Backend: vllm, sglang, trtllm, or mocker.
environment string kubernetes Runtime environment: kubernetes, virtual, or global-planner.
namespace string env DYN_NAMESPACE Kubernetes namespace for the deployment.

Traffic Prediction Settings

Field Type Default Description
load_predictor string linear Prediction method: linear, kalman, or prophet.
load_predictor_log1p bool true Apply log1p transform to load data before prediction.
prophet_window_size int 300 Window size (seconds) for Prophet predictor.
load_predictor_warmup_trace string null Path to a warmup trace file for bootstrapping predictions.

Kalman Filter Settings

Field Type Default Description
kalman_q_level float 0.1 Process noise for level component.
kalman_q_trend float 0.01 Process noise for trend component.
kalman_r float 1.0 Measurement noise.
kalman_min_points int 10 Minimum data points before Kalman predictions activate.

Integration with Profiler

When the profiler runs with planner enabled, it:

  1. Selects the best prefill and decode engine configurations
  2. Generates interpolation curves (TTFT vs ISL, ITL vs KV-cache utilization)
  3. Saves the PlannerConfig and profiling data into separate Kubernetes ConfigMaps
  4. Adds the planner service to the generated DGD, configured to read from those ConfigMaps

The planner receives its config via --config /path/to/planner_config.json which is mounted from the planner-config-XXXX ConfigMap. Profiling data is mounted from the planner-profile-data-XXXX ConfigMap.

See the Profiler Guide for the full profiling workflow and how to configure pre-deployment sweeping.

See Also