dynamo/docs/components/planner/README.md at main · drivenets/dynamo

title
Planner

The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.

The SLA Planner supports two scaling modes:

Throughput-based scaling: Uses pre-deployment profiling data and traffic prediction to compute the number of replicas needed to meet TTFT and ITL SLA targets. This is the primary scaling mode for production deployments.
Load-based scaling (Experimental): Uses real-time per-worker load metrics (active prefill tokens, active KV blocks) from the router to make SLA-aware scaling decisions via online linear regression. Does not require profiling data. Responds quickly to traffic bursts.

When both modes are enabled, throughput-based scaling provides a lower bound on replicas (long-term capacity planning) while load-based scaling handles real-time adjustments (burst response).

New to the Planner? Start with the SLA Planner Quick Start Guide for a complete workflow including profiling and deployment.

Feature Matrix

Feature	Throughput-Based	Load-Based (Experimental)
Deployment
Disaggregated	Supported	Supported
Aggregated	Unsupported	Supported
LLM Framework
SGLang	Supported	Supported
TensorRT-LLM	Supported	Supported
vLLM	Supported	Supported
Requires Profiling Data	Yes	No
Load Predictors	ARIMA, Prophet, Kalman, Constant	N/A
Connectors
KubernetesConnector	Supported	Supported
VirtualConnector	Supported	Supported

When to Use Which Mode

Throughput-based scaling should be enabled whenever engine profiling data is available (through pre-deployment profiling). It provides stable, prediction-based capacity planning.
Load-based scaling should be enabled when traffic is bursty or hard to predict. It reacts quickly to real-time load changes without requiring profiling data.
Both modes together: For the best of both worlds, enable both. Throughput-based scaling provides a lower bound (long-term capacity), while load-based scaling handles bursts above that floor. When both are enabled, use a longer --adjustment-interval for throughput-based scaling.

Quick Start

Prerequisites

Dynamo platform installed on Kubernetes (Installation Guide)
kube-prometheus-stack installed (Metrics Setup)

For throughput-based scaling, pre-deployment profiling is also required (Profiling Guide).

Throughput-Based Scaling (with DGDR)

The fastest path to a throughput-based planner deployment is through a DynamoGraphDeploymentRequest, which automatically profiles your model:

kubectl apply -f components/src/dynamo/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE

See Planner Guide for the full workflow.

Load-Based Scaling (without profiling)

To deploy with load-based scaling only (no profiling required), add these arguments to the planner service in your DGD:

args:
  - --enable-loadbased-scaling
  - --disable-throughput-scaling
  - --loadbased-adjustment-interval=5

The planner will auto-discover the frontend metrics endpoint from the DGD. See disagg_planner_load.yaml for a complete example.

Manual DGD Deployment

For manual control with throughput-based scaling, use the disaggregated planner templates:

# After profiling is complete
kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE

Documentation

Document	Description
Planner Guide	Deployment, configuration, integration, troubleshooting
Planner Examples	DGDR YAML examples, sample configurations, advanced patterns
SLA-Driven Profiling	Pre-deployment profiling process and configuration
Planner Design	Architecture deep-dive for contributors

Configuration Reference

Key Arguments

Argument	Default	Description
Common
`--namespace`	`$DYN_NAMESPACE` or `dynamo`	Dynamo logical namespace
`--backend`	`vllm`	Backend framework (`sglang`, `trtllm`, `vllm`)
`--mode`	`disagg`	Planner mode (`disagg`, `prefill`, `decode`, `agg`)
`--environment`	`kubernetes`	Deployment environment
`--ttft`	`500.0`	Target Time To First Token (ms)
`--itl`	`50.0`	Target Inter-Token Latency (ms)
`--max-gpu-budget`	`8`	Maximum GPUs across all workers
`--min-endpoint`	`1`	Minimum replicas per worker type
`--decode-engine-num-gpu`	`1`	GPUs per decode engine
`--prefill-engine-num-gpu`	`1`	GPUs per prefill engine
`--no-operation`	`false`	Observation mode (no actual scaling)
Throughput-based scaling
`--enable-throughput-scaling`	`true`	Enable throughput-based scaling
`--adjustment-interval`	`180`	Seconds between throughput-based scaling decisions
`--profile-results-dir`	`profiling_results`	Path to profiling data (NPZ/JSON)
`--load-predictor`	`arima`	Prediction model (`arima`, `prophet`, `kalman`, `constant`)
`--no-correction`	`false`	Disable correction factors
Load-based scaling (Experimental)
`--enable-loadbased-scaling`	`false`	Enable load-based scaling
`--disable-throughput-scaling`	`false`	Disable throughput-based scaling (required for `agg` mode)
`--loadbased-router-metrics-url`	auto-discovered	URL to router's `/metrics` endpoint
`--loadbased-adjustment-interval`	`5`	Seconds between load-based scaling decisions
`--loadbased-learning-window`	`50`	Sliding window size for regression model
`--loadbased-scaling-down-sensitivity`	`80`	Scale-down sensitivity 0-100 (0=never, 100=aggressive)
`--loadbased-metric-samples`	`10`	Number of metric samples per adjustment interval
`--loadbased-min-observations`	`5`	Minimum observations before regression activates

Environment Variables

Variable	Default	Description
`DYN_NAMESPACE`	`dynamo`	Dynamo logical namespace
`DYN_PARENT_DGD_K8S_NAME`	(required)	Parent DGD K8s resource name
`PROMETHEUS_ENDPOINT`	`http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090`	Prometheus URL
`PLANNER_PROMETHEUS_PORT`	`0` (disabled)	Port for planner's own Prometheus metrics

Monitoring

Grafana Dashboard

Deploy the planner dashboard:

kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml

The dashboard shows:

Worker counts and GPU usage over time
Observed TTFT, ITL, request rate, sequence lengths
Predicted load and recommended replica counts
Correction factors (actual vs. expected performance)

Prometheus Metrics

Throughput-based scaling pulls traffic metrics from the cluster-wide Prometheus server:

Request count and duration
TTFT and ITL distributions
Input/output sequence lengths

Load-based scaling pulls per-engine status directly from the frontend's /metrics endpoint:

Active prefill tokens per worker
Active decode blocks per worker
Last observed TTFT, ITL, and ISL per worker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Matrix

When to Use Which Mode

Quick Start

Prerequisites

Throughput-Based Scaling (with DGDR)

Load-Based Scaling (without profiling)

Manual DGD Deployment

Documentation

Configuration Reference

Key Arguments

Environment Variables

Monitoring

Grafana Dashboard

Prometheus Metrics

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Feature Matrix

When to Use Which Mode

Quick Start

Prerequisites

Throughput-Based Scaling (with DGDR)

Load-Based Scaling (without profiling)

Manual DGD Deployment

Documentation

Configuration Reference

Key Arguments

Environment Variables

Monitoring

Grafana Dashboard

Prometheus Metrics