Skip to content

Latest commit

 

History

History
143 lines (109 loc) · 5.65 KB

File metadata and controls

143 lines (109 loc) · 5.65 KB

Planner

The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.

New to the Planner? Start with the SLA Planner Quick Start Guide for a complete workflow including profiling and deployment.

Feature Matrix

Category Feature Status
Backend Local (bare metal) Deprecated
Kubernetes Supported
LLM Framework vLLM Supported
TensorRT-LLM Supported
SGLang Supported
Serving Type Aggregated Unsupported
Disaggregated Supported
Scaling Mode SLA-based (TTFT/ITL targets) Supported (primary)
Load-based (KV cache/queue thresholds) Deprecated
Load Predictors ARIMA Supported
Prophet Supported
Kalman filter Supported
Constant (current = next) Supported
Connectors KubernetesConnector (native DGD scaling) Supported
VirtualConnector (external environments) Supported

Quick Start

Prerequisites

Deploy with DGDR (Recommended)

The fastest path to a planner-enabled deployment is through a DynamoGraphDeploymentRequest:

kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE

This automatically profiles your model and deploys with the SLA planner. See SLA Planner Guide for the full workflow.

Deploy with DGD (Manual)

For manual control, use the disaggregated planner templates:

# After profiling is complete
kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE

Documentation

Document Description
Planner Guide Deployment, configuration, integration, troubleshooting
Planner Examples DGDR YAML examples, sample configurations, advanced patterns
SLA Planner Guide End-to-end DGDR workflow: define SLAs, profile, deploy, monitor
SLA-based Planner Scaling algorithm, correction factors, load prediction details
Load-based Planner Legacy load-based scaling (deprecated)
SLA-Driven Profiling Pre-deployment profiling process and configuration
Planner Design Architecture deep-dive for contributors

Configuration Reference

Key Arguments

Argument Default Description
--namespace $DYN_NAMESPACE or dynamo Dynamo logical namespace
--backend vllm Backend framework (vllm, sglang, trtllm)
--environment kubernetes Deployment environment
--adjustment-interval 180 Seconds between scaling decisions
--ttft 500.0 Target Time To First Token (ms)
--itl 50.0 Target Inter-Token Latency (ms)
--isl 3000 Expected average input sequence length
--osl 150 Expected average output sequence length
--load-predictor arima Prediction model (arima, prophet, kalman, constant)
--max-gpu-budget 8 Maximum GPUs across all workers
--min-endpoint 1 Minimum replicas per worker type
--decode-engine-num-gpu 1 GPUs per decode engine
--prefill-engine-num-gpu 1 GPUs per prefill engine
--no-operation false Observation mode (no actual scaling)
--no-correction false Disable correction factors
--profile-results-dir profiling_results Path to profiling data (NPZ/JSON)

Environment Variables

Variable Default Description
DYN_NAMESPACE dynamo Dynamo logical namespace
DYN_PARENT_DGD_K8S_NAME (required) Parent DGD K8s resource name
PROMETHEUS_ENDPOINT http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090 Prometheus URL
PLANNER_PROMETHEUS_PORT 0 (disabled) Port for planner's own Prometheus metrics

Monitoring

Grafana Dashboard

Deploy the planner dashboard:

kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml

The dashboard shows:

  • Worker counts and GPU usage over time
  • Observed TTFT, ITL, request rate, sequence lengths
  • Predicted load and recommended replica counts
  • Correction factors (actual vs. expected performance)

Prometheus Metrics

The planner queries the frontend's /metrics endpoint via Prometheus. Required metrics:

  • Request count and duration
  • TTFT and ITL distributions
  • Input/output sequence lengths
:hidden:

planner_guide
planner_examples