| title |
|---|
Router |
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
To launch the Dynamo frontend with the KV Router:
python -m dynamo.frontend --router-mode kv --http-port 8000For Kubernetes, set DYN_ROUTER_MODE=kv on the Frontend service. Workers automatically report KV cache events — no worker-side configuration changes needed.
| Argument | Default | Description |
|---|---|---|
--router-mode kv |
round_robin |
Enable KV cache-aware routing |
--router-kv-overlap-score-weight |
1.0 |
Balance prefill vs decode optimization (higher = better TTFT) |
--no-router-kv-events |
enabled | Fall back to approximate routing (no event consumption from workers) |
--router-queue-threshold |
disabled | Enable backpressure queue under high concurrency; also enables priority scheduling via nvext.agent_hints.latency_sensitivity |
For all CLI arguments, environment variables, K8s deployment examples, and tuning guidelines, see the Router Guide. For A/B benchmarking, see the KV Router A/B Benchmarking Guide.
Requirements:
- Dynamic endpoints only: KV router requires
register_model()withmodel_input=ModelInput.Tokens. Your backend handler receives pre-tokenized requests withtoken_idsinstead of raw text. - Backend workers must call
register_model()withmodel_input=ModelInput.Tokens(see Backend Guide) - You cannot use
--static-endpointmode with KV routing (use dynamic discovery instead)
Multimodal Support:
- TRT-LLM and vLLM: Multimodal routing supported for images via multimodal hashes
- SGLang: Image routing not yet supported
- Other modalities (audio, video, etc.): Not yet supported
Limitations:
- Static endpoints not supported—KV router requires dynamic model discovery via etcd to track worker instances and their KV cache states
For basic model registration without KV routing, use --router-mode round-robin or --router-mode random with both static and dynamic endpoints.
- Router Guide: Deep dive into KV cache routing, configuration, disaggregated serving, and tuning
- Router Examples: Python API usage, K8s examples, and custom routing patterns
- Standalone Indexer: Run the KV indexer as a separate service for independent scaling
- Router Design: Architecture details, algorithms, and event transport modes