Skip to content

Commit b1818dc

Browse files
dagil-nvidiacoderabbitai[bot]nealvaidya
authored
docs: cherry-pick docs updates to release/1.0.0 (#7330, #7336, #7350, #7352) (#7354)
Signed-off-by: Dan Gil <dagil@nvidia.com> Signed-off-by: dagil-nvidia <dagil@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Neal Vaidya <nealv@nvidia.com>
1 parent c886c58 commit b1818dc

File tree

7 files changed

+286
-129
lines changed

7 files changed

+286
-129
lines changed

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -97,13 +97,13 @@ Containers have all dependencies pre-installed. No setup required.
9797

9898
```bash
9999
# SGLang
100-
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1
100+
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.0
101101

102102
# TensorRT-LLM
103-
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1
103+
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
104104

105105
# vLLM
106-
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1
106+
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0
107107
```
108108

109109
> **Tip:** To run frontend and worker in the same container, either run processes in background with `&` (see below), or open a second terminal and use `docker exec -it <container_id> bash`.
Lines changed: 57 additions & 21 deletions
Loading

docs/assets/img/intro-perf.svg

Lines changed: 63 additions & 22 deletions
Loading

docs/getting-started/introduction.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
---
22
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
# SPDX-License-Identifier: Apache-2.0
4-
title: Introduction
4+
title: Introduction to Dynamo
55
sidebar-title: Introduction
66
---
77

8-
# Introduction
8+
# Introduction to Dynamo
99

10-
Dynamo is NVIDIA's high-throughput, low-latency inference framework, designed to serve generative AI workloads in distributed environments. This page gives an overview of Dynamo's design principles, performance benefits, and production-grade features.
10+
Dynamo is an open-source, high-throughput, low-latency inference framework, designed to serve generative AI workloads in distributed environments. This page gives an overview of Dynamo's design principles, performance benefits, and production-grade features.
1111

1212
> [!TIP]
1313
> Looking to get started right away? See the [Quickstart](quickstart.md) to install and run Dynamo in minutes.
@@ -53,12 +53,12 @@ The Dynamo ecosystem includes these additional modular components, and will cont
5353
| :--- | :--- | :--- |
5454
| **Scheduling** | Dynamo | Inference serving for GenAI workloads |
5555
| **Routing** | Router | Smart routing leveraging KV cache hit rate and KV cache load. More algorithms will be added (e.g., agentic routing) |
56-
| **Data Transfer** | NIXL | Point-to-point data transfer between GPUs and tiered storage (G1: GPU, G2: CPU, G3: SSD, G4: remote) |
56+
| **Data Transfer** | [NIXL](https://github.com/ai-dynamo/nixl) | Point-to-point data transfer between GPUs and tiered storage (G1: GPU, G2: CPU, G3: SSD, G4: remote) |
5757
| **Memory** | KVBM (KV Block Manager) | Manage KV cache across memory tiers (G1-G4) with customizable eviction policy |
5858
| **Scaling / Cloud** | Planner | Automatically tune performance in real time for prefill and decode given SLA constraints (TTFT and TPOT) |
59-
| | Grove | Enables gang scheduling and topology awareness required for Kubernetes multi-node disaggregated serving |
59+
| | [Grove](https://github.com/ai-dynamo/grove) | Enables gang scheduling and topology awareness required for Kubernetes multi-node disaggregated serving |
6060
| | [Model Express](https://github.com/ai-dynamo/model-express) | Load model weights fast by caching and transferring them via NIXL to other GPUs. Will also be leveraged for fault tolerance |
61-
| **Perf** | AIConfigurator | Estimate performance for aggregated vs. disaggregated serving based on model, ISL/OSL, HW, etc. Formerly known as LLMPet |
61+
| **Perf** | [AIConfigurator](https://github.com/ai-dynamo/aiconfigurator) | Estimate performance for aggregated vs. disaggregated serving based on model, ISL/OSL, HW, etc. Formerly known as LLMPet |
6262
| | [AIPerf](https://github.com/ai-dynamo/aiperf) | Re-architected GenAI-Perf written in Python for maximum extensibility; supports distributed benchmarking |
6363
| | AITune | Given a model or pipeline, searches for best backend to deploy with (e.g., TensorRT, Torch.compile, etc.) (coming soon) |
6464
| | Flex Tensor | Stream weights to GPUs from host memory to run very large language models in GPUs with limited memory capacity (coming soon) |
@@ -85,7 +85,7 @@ The full list of supported ecosystem components:
8585

8686
## Performance
8787

88-
Dynamo achieves state-of-the-art LLM performance by composing three core techniques: Disaggregated Serving, KV Cache Aware Routing, and KV Cache Offloading. These techniques are underpinned by NIXL, a low-latency data transfer layer that enables seamless KV cache movement between nodes.
88+
Dynamo achieves state-of-the-art LLM performance by composing three core techniques: Disaggregated Serving, KV Cache-Aware Routing, and KV Cache Offloading. These techniques are underpinned by NIXL, a low-latency data transfer layer that enables seamless KV cache movement between nodes.
8989

9090
- [KV cache-aware routing](../design-docs/router-design.md) Smartly routes requests based on worker load and existing cache hits. By reusing precomputed KV pairs, it bypasses the prefill compute, starting the decode phase immediately. [Baseten](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/#how-baseten-uses-nvidia-dynamo) applied Dynamo KV cache-aware routing and saw 2x faster TTFT and 1.6x throughput on Qwen3 Coder 480B A35B.
9191

@@ -94,11 +94,11 @@ Dynamo achieves state-of-the-art LLM performance by composing three core techniq
9494
- [Disaggregated serving](../design-docs/disagg-serving.md) In the Design Principles section, we introduced the concept of disaggregated serving. Its performance has been showcased by [InferenceX](https://newsletter.semianalysis.com/p/inferencex-v2-nvidia-blackwell-vs). DeepSeek V3 can be served with ~7x throughput/GPU, with disaggregated serving and large-scale expert parallelism.
9595
Furthermore, when these three techniques are composed together, they yield compounding benefits as shown in the following diagram.
9696

97-
![Performance composability of disaggregated serving, KV cache aware routing, and KV cache offloading](../assets/img/intro-perf.svg)
97+
![Performance composability of disaggregated serving, KV cache-aware routing, and KV cache offloading](../assets/img/intro-perf.svg)
9898

99-
- **Disaggregated serving + KV cache aware routing** -- KV cache aware routing load balances for both compute (on prefill) and memory (on decode), optimizing latency and throughput simultaneously.
100-
- **Disaggregated serving + KV cache offloading** -- KV cache offloading results in faster TTFT, and the number of prefill workers can be reduced to reduce TCO.
101-
- **KV cache aware routing + KV cache offloading** -- Offloading increases the total addressable cache size, increasing the KV cache hit rate, which in turn accelerates the TTFT.
99+
- **Disaggregated Serving + KV Cache-Aware Routing** -- KV cache-aware routing load balances for both compute (on prefill) and memory (on decode), optimizing latency and throughput simultaneously.
100+
- **Disaggregated Serving + KV Cache Offloading** -- KV cache offloading results in faster TTFT, and the number of prefill workers can be reduced to reduce TCO.
101+
- **KV Cache-Aware Routing + KV Cache Offloading** -- Offloading increases the total addressable cache size, increasing the KV cache hit rate, which in turn accelerates the TTFT.
102102

103103
> [!TIP]
104104
> Ready to try these techniques? See [Dynamo recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes) for step-by-step deployment examples that compose disaggregated serving, routing, and offloading.
@@ -153,7 +153,7 @@ Dynamo provides built-in metrics, distributed tracing, and logging for monitorin
153153
Explore the following resources to go deeper:
154154

155155
- [Recipes](https://github.com/ai-dynamo/dynamo/tree/main/recipes) -- Compose disaggregated serving, routing, and offloading
156-
- [KV Cache Aware Routing](../components/router/router-guide.md) -- Configure smart request routing
156+
- [KV Cache-Aware Routing](../components/router/router-guide.md) -- Configure smart request routing
157157
- [KV Cache Offloading](../components/kvbm/kvbm-guide.md) -- Set up multi-tier memory management
158158
- [Planner](../components/planner/planner-guide.md) -- Configure SLA-based autoscaling
159159
- [Kubernetes Deployment](../kubernetes/README.md) -- Deploy at scale with Grove

docs/index.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -147,10 +147,10 @@ navigation:
147147
path: fault-tolerance/request-migration.md
148148
- page: Request Cancellation
149149
path: fault-tolerance/request-cancellation.md
150-
- page: Graceful Shutdown
151-
path: fault-tolerance/graceful-shutdown.md
152150
- page: Request Rejection
153151
path: fault-tolerance/request-rejection.md
152+
- page: Graceful Shutdown
153+
path: fault-tolerance/graceful-shutdown.md
154154
- page: Testing
155155
path: fault-tolerance/testing.md
156156
- page: Writing Python Workers

0 commit comments

Comments
 (0)