The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
New to the Planner? Start with the SLA Planner Quick Start Guide for a complete workflow including profiling and deployment.
| Category | Feature | Status |
|---|---|---|
| Backend | Local (bare metal) | Deprecated |
| Kubernetes | Supported | |
| LLM Framework | vLLM | Supported |
| TensorRT-LLM | Supported | |
| SGLang | Supported | |
| Serving Type | Aggregated | Unsupported |
| Disaggregated | Supported | |
| Scaling Mode | SLA-based (TTFT/ITL targets) | Supported (primary) |
| Load-based (KV cache/queue thresholds) | Deprecated | |
| Load Predictors | ARIMA | Supported |
| Prophet | Supported | |
| Kalman filter | Supported | |
| Constant (current = next) | Supported | |
| Connectors | KubernetesConnector (native DGD scaling) | Supported |
| VirtualConnector (external environments) | Supported |
- Dynamo platform installed on Kubernetes (Installation Guide)
- kube-prometheus-stack installed (Metrics Setup)
- Pre-deployment profiling completed (Profiling Guide)
The fastest path to a planner-enabled deployment is through a DynamoGraphDeploymentRequest:
kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACEThis automatically profiles your model and deploys with the SLA planner. See SLA Planner Guide for the full workflow.
For manual control, use the disaggregated planner templates:
# After profiling is complete
kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE| Document | Description |
|---|---|
| Planner Guide | Deployment, configuration, integration, troubleshooting |
| Planner Examples | DGDR YAML examples, sample configurations, advanced patterns |
| SLA Planner Guide | End-to-end DGDR workflow: define SLAs, profile, deploy, monitor |
| SLA-based Planner | Scaling algorithm, correction factors, load prediction details |
| Load-based Planner | Legacy load-based scaling (deprecated) |
| SLA-Driven Profiling | Pre-deployment profiling process and configuration |
| Planner Design | Architecture deep-dive for contributors |
| Argument | Default | Description |
|---|---|---|
--namespace |
$DYN_NAMESPACE or dynamo |
Dynamo logical namespace |
--backend |
vllm |
Backend framework (vllm, sglang, trtllm) |
--environment |
kubernetes |
Deployment environment |
--adjustment-interval |
180 |
Seconds between scaling decisions |
--ttft |
500.0 |
Target Time To First Token (ms) |
--itl |
50.0 |
Target Inter-Token Latency (ms) |
--isl |
3000 |
Expected average input sequence length |
--osl |
150 |
Expected average output sequence length |
--load-predictor |
arima |
Prediction model (arima, prophet, kalman, constant) |
--max-gpu-budget |
8 |
Maximum GPUs across all workers |
--min-endpoint |
1 |
Minimum replicas per worker type |
--decode-engine-num-gpu |
1 |
GPUs per decode engine |
--prefill-engine-num-gpu |
1 |
GPUs per prefill engine |
--no-operation |
false |
Observation mode (no actual scaling) |
--no-correction |
false |
Disable correction factors |
--profile-results-dir |
profiling_results |
Path to profiling data (NPZ/JSON) |
| Variable | Default | Description |
|---|---|---|
DYN_NAMESPACE |
dynamo |
Dynamo logical namespace |
DYN_PARENT_DGD_K8S_NAME |
(required) | Parent DGD K8s resource name |
PROMETHEUS_ENDPOINT |
http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090 |
Prometheus URL |
PLANNER_PROMETHEUS_PORT |
0 (disabled) |
Port for planner's own Prometheus metrics |
Deploy the planner dashboard:
kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yamlThe dashboard shows:
- Worker counts and GPU usage over time
- Observed TTFT, ITL, request rate, sequence lengths
- Predicted load and recommended replica counts
- Correction factors (actual vs. expected performance)
The planner queries the frontend's /metrics endpoint via Prometheus. Required metrics:
- Request count and duration
- TTFT and ITL distributions
- Input/output sequence lengths
:hidden:
planner_guide
planner_examples