You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -97,13 +97,13 @@ Containers have all dependencies pre-installed. No setup required.
97
97
98
98
```bash
99
99
# SGLang
100
-
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1
100
+
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.0
101
101
102
102
# TensorRT-LLM
103
-
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1
103
+
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
104
104
105
105
# vLLM
106
-
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1
106
+
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
107
107
```
108
108
109
109
> **Tip:** To run frontend and worker in the same container, either run processes in background with `&` (see below), or open a second terminal and use `docker exec -it <container_id> bash`.
Copy file name to clipboardExpand all lines: docs/getting-started/introduction.md
+12-12Lines changed: 12 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,13 @@
1
1
---
2
2
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3
3
# SPDX-License-Identifier: Apache-2.0
4
-
title: Introduction
4
+
title: Introduction to Dynamo
5
5
sidebar-title: Introduction
6
6
---
7
7
8
-
# Introduction
8
+
# Introduction to Dynamo
9
9
10
-
Dynamo is NVIDIA's high-throughput, low-latency inference framework, designed to serve generative AI workloads in distributed environments. This page gives an overview of Dynamo's design principles, performance benefits, and production-grade features.
10
+
Dynamo is an open-source, high-throughput, low-latency inference framework, designed to serve generative AI workloads in distributed environments. This page gives an overview of Dynamo's design principles, performance benefits, and production-grade features.
11
11
12
12
> [!TIP]
13
13
> Looking to get started right away? See the [Quickstart](quickstart.md) to install and run Dynamo in minutes.
@@ -53,12 +53,12 @@ The Dynamo ecosystem includes these additional modular components, and will cont
53
53
| :--- | :--- | :--- |
54
54
|**Scheduling**| Dynamo | Inference serving for GenAI workloads |
55
55
|**Routing**| Router | Smart routing leveraging KV cache hit rate and KV cache load. More algorithms will be added (e.g., agentic routing) |
56
-
|**Data Transfer**| NIXL | Point-to-point data transfer between GPUs and tiered storage (G1: GPU, G2: CPU, G3: SSD, G4: remote) |
56
+
|**Data Transfer**|[NIXL](https://github.com/ai-dynamo/nixl)| Point-to-point data transfer between GPUs and tiered storage (G1: GPU, G2: CPU, G3: SSD, G4: remote) |
57
57
|**Memory**| KVBM (KV Block Manager) | Manage KV cache across memory tiers (G1-G4) with customizable eviction policy |
58
58
|**Scaling / Cloud**| Planner | Automatically tune performance in real time for prefill and decode given SLA constraints (TTFT and TPOT) |
59
-
|| Grove | Enables gang scheduling and topology awareness required for Kubernetes multi-node disaggregated serving |
59
+
||[Grove](https://github.com/ai-dynamo/grove)| Enables gang scheduling and topology awareness required for Kubernetes multi-node disaggregated serving |
60
60
||[Model Express](https://github.com/ai-dynamo/model-express)| Load model weights fast by caching and transferring them via NIXL to other GPUs. Will also be leveraged for fault tolerance |
61
-
|**Perf**| AIConfigurator | Estimate performance for aggregated vs. disaggregated serving based on model, ISL/OSL, HW, etc. Formerly known as LLMPet |
61
+
|**Perf**|[AIConfigurator](https://github.com/ai-dynamo/aiconfigurator)| Estimate performance for aggregated vs. disaggregated serving based on model, ISL/OSL, HW, etc. Formerly known as LLMPet |
62
62
||[AIPerf](https://github.com/ai-dynamo/aiperf)| Re-architected GenAI-Perf written in Python for maximum extensibility; supports distributed benchmarking |
63
63
|| AITune | Given a model or pipeline, searches for best backend to deploy with (e.g., TensorRT, Torch.compile, etc.) (coming soon) |
64
64
|| Flex Tensor | Stream weights to GPUs from host memory to run very large language models in GPUs with limited memory capacity (coming soon) |
@@ -85,7 +85,7 @@ The full list of supported ecosystem components:
85
85
86
86
## Performance
87
87
88
-
Dynamo achieves state-of-the-art LLM performance by composing three core techniques: Disaggregated Serving, KV CacheAware Routing, and KV Cache Offloading. These techniques are underpinned by NIXL, a low-latency data transfer layer that enables seamless KV cache movement between nodes.
88
+
Dynamo achieves state-of-the-art LLM performance by composing three core techniques: Disaggregated Serving, KV Cache-Aware Routing, and KV Cache Offloading. These techniques are underpinned by NIXL, a low-latency data transfer layer that enables seamless KV cache movement between nodes.
89
89
90
90
-[KV cache-aware routing](../design-docs/router-design.md) Smartly routes requests based on worker load and existing cache hits. By reusing precomputed KV pairs, it bypasses the prefill compute, starting the decode phase immediately. [Baseten](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/#how-baseten-uses-nvidia-dynamo) applied Dynamo KV cache-aware routing and saw 2x faster TTFT and 1.6x throughput on Qwen3 Coder 480B A35B.
91
91
@@ -94,11 +94,11 @@ Dynamo achieves state-of-the-art LLM performance by composing three core techniq
94
94
-[Disaggregated serving](../design-docs/disagg-serving.md) In the Design Principles section, we introduced the concept of disaggregated serving. Its performance has been showcased by [InferenceX](https://newsletter.semianalysis.com/p/inferencex-v2-nvidia-blackwell-vs). DeepSeek V3 can be served with ~7x throughput/GPU, with disaggregated serving and large-scale expert parallelism.
95
95
Furthermore, when these three techniques are composed together, they yield compounding benefits as shown in the following diagram.
96
96
97
-

97
+

98
98
99
-
-**Disaggregated serving + KV cache aware routing** -- KV cacheaware routing load balances for both compute (on prefill) and memory (on decode), optimizing latency and throughput simultaneously.
100
-
-**Disaggregated serving + KV cache offloading** -- KV cache offloading results in faster TTFT, and the number of prefill workers can be reduced to reduce TCO.
101
-
-**KV cache aware routing + KV cache offloading** -- Offloading increases the total addressable cache size, increasing the KV cache hit rate, which in turn accelerates the TTFT.
99
+
-**Disaggregated Serving + KV Cache-Aware Routing** -- KV cache-aware routing load balances for both compute (on prefill) and memory (on decode), optimizing latency and throughput simultaneously.
100
+
-**Disaggregated Serving + KV Cache Offloading** -- KV cache offloading results in faster TTFT, and the number of prefill workers can be reduced to reduce TCO.
101
+
-**KV Cache-Aware Routing + KV Cache Offloading** -- Offloading increases the total addressable cache size, increasing the KV cache hit rate, which in turn accelerates the TTFT.
102
102
103
103
> [!TIP]
104
104
> Ready to try these techniques? See [Dynamo recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes) for step-by-step deployment examples that compose disaggregated serving, routing, and offloading.
@@ -153,7 +153,7 @@ Dynamo provides built-in metrics, distributed tracing, and logging for monitorin
153
153
Explore the following resources to go deeper:
154
154
155
155
-[Recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes) -- Compose disaggregated serving, routing, and offloading
0 commit comments