Skip to content

Commit e8c7bbf

Browse files
docs: refactor Dynamo readme.md and quick_start_local.rst (#5649)
Signed-off-by: Dan Gil <[email protected]> Co-authored-by: Cursor <[email protected]>
1 parent 7d5ed66 commit e8c7bbf

File tree

6 files changed

+272
-185
lines changed

6 files changed

+272
-185
lines changed

README.md

Lines changed: 144 additions & 113 deletions
Original file line numberDiff line numberDiff line change
@@ -44,115 +44,156 @@ Dynamo is inference engine agnostic (supports TRT-LLM, vLLM, SGLang) and provide
4444
- **Accelerated Data Transfer** – Reduces inference response time using NIXL
4545
- **KV Cache Offloading** – Leverages multiple memory hierarchies for higher throughput
4646

47-
<p align="center">
48-
<img src="./docs/images/frontpage-architecture.png" alt="Dynamo architecture" width="600" />
49-
</p>
50-
5147
Built in Rust for performance and Python for extensibility, Dynamo is fully open-source with an OSS-first development approach.
5248

53-
## Framework Support Matrix
49+
## Backend Feature Support
5450

55-
| Feature | [vLLM](docs/backends/vllm/README.md) | [SGLang](docs/backends/sglang/README.md) | [TensorRT-LLM](docs/backends/trtllm/README.md) |
56-
| -------------------------------------------------------------------- | :--: | :----: | :----------: |
57-
| [**Disaggregated Serving**](docs/design_docs/disagg_serving.md) ||||
58-
| [**KV-Aware Routing**](docs/router/kv_cache_routing.md) ||||
59-
| [**SLA-Based Planner**](docs/planner/sla_planner.md) ||||
60-
| [**KVBM**](docs/kvbm/kvbm_architecture.md) || 🚧 ||
61-
| [**Multimodal**](docs/multimodal/index.md) ||||
62-
| [**Tool Calling**](docs/agents/tool-calling.md) ||||
51+
| | [SGLang](docs/backends/sglang/README.md) | [TensorRT-LLM](docs/backends/trtllm/README.md) | [vLLM](docs/backends/vllm/README.md) |
52+
|---|:----:|:----------:|:--:|
53+
| **Best For** | High-throughput serving | Maximum performance | Broadest feature coverage |
54+
| [**Disaggregated Serving**](docs/design_docs/disagg_serving.md) ||||
55+
| [**KV-Aware Routing**](docs/router/kv_cache_routing.md) ||||
56+
| [**SLA-Based Planner**](docs/planner/sla_planner.md) ||||
57+
| [**KVBM**](docs/kvbm/kvbm_architecture.md) | 🚧 |||
58+
| [**Multimodal**](docs/multimodal/index.md) ||||
59+
| [**Tool Calling**](docs/agents/tool-calling.md) ||||
6360

6461
> **[Full Feature Matrix →](docs/reference/feature-matrix.md)** — Detailed compatibility including LoRA, Request Migration, Speculative Decoding, and feature interactions.
6562
63+
## Dynamo Architecture
64+
65+
<p align="center">
66+
<img src="./docs/images/frontpage-architecture.png" alt="Dynamo architecture" width="600" />
67+
</p>
68+
69+
> **[Architecture Deep Dive →](docs/design_docs/architecture.md)**
70+
6671
## Latest News
6772

6873
- [12/05] [Moonshot AI's Kimi K2 achieves 10x inference speedup with Dynamo on GB200](https://quantumzeitgeist.com/kimi-k2-nvidia-ai-ai-breakthrough/)
6974
- [12/02] [Mistral AI runs Mistral Large 3 with 10x faster inference using Dynamo](https://www.marktechpost.com/2025/12/02/nvidia-and-mistral-ai-bring-10x-faster-inference-for-the-mistral-3-family-on-gb200-nvl72-gpu-systems/)
7075
- [12/01] [InfoQ: NVIDIA Dynamo simplifies Kubernetes deployment for LLM inference](https://www.infoq.com/news/2025/12/nvidia-dynamo-kubernetes/)
71-
- [11/20] [Dell integrates PowerScale with Dynamo's NIXL for 19x faster TTFT](https://www.dell.com/en-us/dt/corporate/newsroom/announcements/detailpage.press-releases~usa~2025~11~dell-technologies-and-nvidia-advance-enterprise-ai-innovation.htm)
72-
- [11/20] [WEKA partners with NVIDIA on KV cache storage for Dynamo](https://siliconangle.com/2025/11/20/nvidia-weka-kv-cache-solution-ai-inferencing-sc25/)
73-
- [11/13] [Dynamo Office Hours Playlist](https://www.youtube.com/playlist?list=PL5B692fm6--tgryKu94h2Zb7jTFM3Go4X)
74-
- [10/16] [How Baseten achieved 2x faster inference with NVIDIA Dynamo](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/)
7576

7677
## Get Started
7778

7879
| Path | Use Case | Time | Requirements |
7980
|------|----------|------|--------------|
8081
| [**Local Quick Start**](#local-quick-start) | Test on a single machine | ~5 min | 1 GPU, Ubuntu 24.04 |
8182
| [**Kubernetes Deployment**](#kubernetes-deployment) | Production multi-node clusters | ~30 min | K8s cluster with GPUs |
83+
| [**Building from Source**](#building-from-source) | Contributors and development | ~15 min | Ubuntu, Rust, Python |
8284

83-
## Contributing
84-
85-
Want to help shape the future of distributed LLM inference? We welcome contributors at all levels—from doc fixes to new features.
86-
87-
- **[Contributing Guide](CONTRIBUTING.md)** – How to get started
88-
- **[Report a Bug](https://github.com/ai-dynamo/dynamo/issues/new?template=bug_report.yml)** – Found an issue?
89-
- **[Feature Request](https://github.com/ai-dynamo/dynamo/issues/new?template=feature_request.yml)** – Have an idea?
85+
Want to help shape the future of distributed LLM inference? See the **[Contributing Guide](CONTRIBUTING.md)**.
9086

9187
# Local Quick Start
9288

9389
The following examples require a few system level packages.
9490
Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/reference/support-matrix.md](docs/reference/support-matrix.md)
9591

96-
## 1. Initial Setup
92+
## Install Dynamo
9793

98-
The Dynamo team recommends the `uv` Python package manager, although any way works. Install uv:
94+
### Option A: Containers (Recommended)
9995

100-
```
101-
curl -LsSf https://astral.sh/uv/install.sh | sh
102-
```
96+
Containers have all dependencies pre-installed. No setup required.
10397

104-
### Install Python Development Headers
98+
```bash
99+
# SGLang
100+
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1
105101

106-
Backend engines require Python development headers for JIT compilation. Install them with:
102+
# TensorRT-LLM
103+
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1
107104

108-
```bash
109-
sudo apt install python3-dev
105+
# vLLM
106+
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1
110107
```
111108

112-
## 2. Select an Engine
109+
> **Tip:** To run frontend and worker in the same container, either run processes in background with `&` (see below), or open a second terminal and use `docker exec -it <container_id> bash`.
113110
114-
We publish Python wheels specialized for each of our supported engines: vllm, sglang, and trtllm. The examples that follow use SGLang; continue reading for other engines.
111+
See [Release Artifacts](docs/reference/release-artifacts.md#container-images) for available versions.
115112

116-
```
113+
### Option B: Install from PyPI
114+
115+
The Dynamo team recommends the `uv` Python package manager, although any way works.
116+
117+
```bash
118+
# Install uv (recommended Python package manager)
119+
curl -LsSf https://astral.sh/uv/install.sh | sh
120+
121+
# Create virtual environment
117122
uv venv venv
118123
source venv/bin/activate
119124
uv pip install pip
125+
```
120126

121-
# Choose one
122-
uv pip install "ai-dynamo[sglang]" #replace with [vllm], [trtllm], etc.
127+
Install system dependencies and the Dynamo wheel for your chosen backend:
128+
129+
**SGLang**
130+
131+
```bash
132+
sudo apt install python3-dev
133+
uv pip install "ai-dynamo[sglang]"
123134
```
124135

125-
## 3. Run Dynamo
136+
> **Note:** For CUDA 13 (B300/GB300), the container is recommended. See [SGLang install docs](https://docs.sglang.ai/start/install.html) for details.
126137
127-
### Sanity Check (Optional)
138+
**TensorRT-LLM**
128139

129-
Before trying out Dynamo, you can verify your system configuration and dependencies:
140+
```bash
141+
sudo apt install python3-dev
142+
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130
143+
pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]"
144+
```
145+
146+
> **Note:** TensorRT-LLM requires `pip` due to a transitive Git URL dependency that `uv` doesn't resolve. We recommend using the [TensorRT-LLM container](docs/reference/release-artifacts.md#container-images) for broader compatibility.
147+
148+
**vLLM**
130149

131150
```bash
132-
python3 deploy/sanity_check.py
151+
sudo apt install python3-dev libxcb1
152+
uv pip install "ai-dynamo[vllm]"
133153
```
134154

135-
This is a quick check for system resources, development tools, LLM frameworks, and Dynamo components.
155+
## Run Dynamo
136156

137-
### Running an LLM API Server
157+
> **Tip (Optional):** Before running Dynamo, verify your system configuration with `python3 deploy/sanity_check.py`
138158
139159
Dynamo provides a simple way to spin up a local set of inference components including:
140160

141161
- **OpenAI Compatible Frontend** – High performance OpenAI compatible http api server written in Rust.
142162
- **Basic and Kv Aware Router** – Route and load balance traffic to a set of workers.
143163
- **Workers** – Set of pre-configured LLM serving engines.
144164

165+
Start the frontend:
166+
167+
> **Tip:** To run in a single terminal (useful in containers), append `> logfile.log 2>&1 &` to run processes in background. Example: `python3 -m dynamo.frontend --store-kv file > dynamo.frontend.log 2>&1 &`
168+
145169
```bash
146170
# Start an OpenAI compatible HTTP server with prompt templating, tokenization, and routing.
147171
# For local dev: --store-kv file avoids etcd (workers and frontend must share a disk)
148172
python3 -m dynamo.frontend --http-port 8000 --store-kv file
173+
```
174+
175+
In another terminal (or same terminal if using background mode), start a worker for your chosen backend:
149176

150-
# Start the SGLang engine. You can run several of these for the same or different models.
151-
# The frontend will discover them automatically.
152-
python3 -m dynamo.sglang --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B --store-kv file
177+
```bash
178+
# SGLang
179+
python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --store-kv file
180+
181+
# TensorRT-LLM
182+
python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --store-kv file
183+
184+
# vLLM (note: uses --model, not --model-path)
185+
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --store-kv file \
186+
--kv-events-config '{"enable_kv_cache_events": false}'
153187
```
154188

155-
> **Note:** vLLM workers publish KV cache events by default, which requires NATS. For dependency-free local development with vLLM, add `--kv-events-config '{"enable_kv_cache_events": false}'`. This keeps local prefix caching enabled while disabling event publishing. See [Service Discovery and Messaging](#service-discovery-and-messaging) for details.
189+
> **Note:** For dependency-free local development, disable KV event publishing (avoids NATS):
190+
> - **vLLM:** Add `--kv-events-config '{"enable_kv_cache_events": false}'`
191+
> - **SGLang:** No flag needed (KV events disabled by default)
192+
> - **TensorRT-LLM:** No flag needed (KV events disabled by default)
193+
>
194+
> **TensorRT-LLM only:** The warning `Cannot connect to ModelExpress server/transport error. Using direct download.` is expected and can be safely ignored.
195+
>
196+
> See [Service Discovery and Messaging](#service-discovery-and-messaging) for details.
156197
157198
#### Send a Request
158199

@@ -172,13 +213,6 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json"
172213

173214
Rerun with `curl -N` and change `stream` in the request to `true` to get the responses as soon as the engine issues them.
174215

175-
### What's Next?
176-
177-
- **Scale up**: Deploy on Kubernetes with [Recipes](recipes/)
178-
- **Add features**: Enable [KV-aware routing](docs/router/kv_cache_routing.md), [disaggregated serving](docs/design_docs/disagg_serving.md)
179-
- **Benchmark**: Use [AIPerf](docs/benchmarks/benchmarking.md) to measure performance
180-
- **Try other engines**: [vLLM](docs/backends/vllm/), [SGLang](docs/backends/sglang/), [TensorRT-LLM](docs/backends/trtllm/)
181-
182216
# Kubernetes Deployment
183217

184218
For production deployments on Kubernetes clusters with multiple GPUs.
@@ -206,60 +240,6 @@ See [recipes/README.md](recipes/README.md) for the full list and deployment inst
206240
- [Amazon EKS](examples/deployments/EKS/)
207241
- [Google GKE](examples/deployments/GKE/)
208242

209-
# Concepts
210-
211-
## Engines
212-
213-
Dynamo is inference engine agnostic. Install the wheel for your chosen engine and run with `python3 -m dynamo.<engine> --help`.
214-
215-
| Engine | Install | Docs | Best For |
216-
|--------|---------|------|----------|
217-
| vLLM | `uv pip install ai-dynamo[vllm]` | [Guide](docs/backends/vllm/) | Broadest feature coverage |
218-
| SGLang | `uv pip install ai-dynamo[sglang]` | [Guide](docs/backends/sglang/) | High-throughput serving |
219-
| TensorRT-LLM | `pip install --pre --extra-index-url https://pypi.nvidia.com ai-dynamo[trtllm]` | [Guide](docs/backends/trtllm/) | Maximum performance |
220-
221-
> **Note:** TensorRT-LLM requires `pip` (not `uv`) due to URL-based dependencies. See the [TRT-LLM guide](docs/backends/trtllm/) for container setup and prerequisites.
222-
223-
Use `CUDA_VISIBLE_DEVICES` to specify which GPUs to use. Engine-specific options (context length, multi-GPU, etc.) are documented in each backend guide.
224-
225-
## Service Discovery and Messaging
226-
227-
Dynamo uses TCP for inter-component communication. External services are optional for most deployments:
228-
229-
| Deployment | etcd | NATS | Notes |
230-
|------------|------|------|-------|
231-
| **Kubernetes** | ❌ Not required | ❌ Not required | K8s-native discovery; TCP request plane |
232-
| **Local Development** | ❌ Not required | ❌ Not required | Pass `--store-kv file`; vLLM also needs `--kv-events-config '{"enable_kv_cache_events": false}'` |
233-
| **KV-Aware Routing** || ✅ Required | Prefix caching enabled by default requires NATS |
234-
235-
For local development without external dependencies, pass `--store-kv file` (avoids etcd) to both the frontend and workers. vLLM users should also pass `--kv-events-config '{"enable_kv_cache_events": false}'` to disable KV event publishing (avoids NATS) while keeping local prefix caching enabled; SGLang and TRT-LLM don't require this flag.
236-
237-
For distributed non-Kubernetes deployments or KV-aware routing:
238-
239-
- [etcd](https://etcd.io/) can be run directly as `./etcd`.
240-
- [nats](https://nats.io/) needs JetStream enabled: `nats-server -js`.
241-
242-
To quickly setup both: `docker compose -f deploy/docker-compose.yml up -d`
243-
244-
# Advanced Topics
245-
246-
## Benchmarking
247-
248-
Dynamo provides comprehensive benchmarking tools:
249-
250-
- **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies using AIPerf
251-
- **[SLA-Driven Deployments](docs/planner/sla_planner_quickstart.md)** – Optimize deployments to meet SLA requirements
252-
253-
## Frontend OpenAPI Specification
254-
255-
The OpenAI-compatible frontend exposes an OpenAPI 3 spec at `/openapi.json`. To generate without running the server:
256-
257-
```bash
258-
cargo run -p dynamo-llm --bin generate-frontend-openapi
259-
```
260-
261-
This writes to `docs/frontends/openapi.json`.
262-
263243
# Building from Source
264244

265245
For contributors who want to build Dynamo from source rather than installing from PyPI.
@@ -347,13 +327,64 @@ cd $PROJECT_ROOT
347327
uv pip install -e .
348328
```
349329

350-
You should now be able to run `python3 -m dynamo.frontend`.
330+
## 8. Run the Frontend
331+
332+
```bash
333+
python3 -m dynamo.frontend
334+
```
335+
336+
## 9. Configure for Local Development
351337

352-
For local development, pass `--store-kv file` to avoid external dependencies (see Service Discovery and Messaging section).
338+
- Pass `--store-kv file` to avoid external dependencies (see [Service Discovery and Messaging](#service-discovery-and-messaging))
339+
- Set `DYN_LOG` to adjust the logging level (e.g., `export DYN_LOG=debug`). Uses the same syntax as `RUST_LOG`
353340

354-
Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`.
341+
> **Note:** VSCode and Cursor users can use the `.devcontainer` folder for a pre-configured dev environment. See the [devcontainer README](.devcontainer/README.md) for details.
355342
356-
If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details.
343+
# Advanced Topics
344+
345+
## Benchmarking
346+
347+
Dynamo provides comprehensive benchmarking tools:
348+
349+
- **[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies using AIPerf
350+
- **[SLA-Driven Deployments](docs/planner/sla_planner_quickstart.md)** – Optimize deployments to meet SLA requirements
351+
352+
## Frontend OpenAPI Specification
353+
354+
The OpenAI-compatible frontend exposes an OpenAPI 3 spec at `/openapi.json`. To generate without running the server:
355+
356+
```bash
357+
cargo run -p dynamo-llm --bin generate-frontend-openapi
358+
```
359+
360+
This writes to `docs/frontends/openapi.json`.
361+
362+
## Service Discovery and Messaging
363+
364+
Dynamo uses TCP for inter-component communication. On Kubernetes, native resources ([CRDs + EndpointSlices](docs/kubernetes/service_discovery.md)) handle service discovery. External services are optional for most deployments:
365+
366+
| Deployment | etcd | NATS | Notes |
367+
|------------|------|------|-------|
368+
| **Local Development** | ❌ Not required | ❌ Not required | Pass `--store-kv file`; vLLM also needs `--kv-events-config '{"enable_kv_cache_events": false}'` |
369+
| **Kubernetes** | ❌ Not required | ❌ Not required | K8s-native discovery; TCP request plane |
370+
371+
> **Note:** KV-Aware Routing requires NATS for prefix caching coordination.
372+
373+
For Slurm or other distributed deployments (and KV-aware routing):
374+
375+
- [etcd](https://etcd.io/) can be run directly as `./etcd`.
376+
- [nats](https://nats.io/) needs JetStream enabled: `nats-server -js`.
377+
378+
To quickly setup both: `docker compose -f deploy/docker-compose.yml up -d`
379+
380+
See [SGLang on Slurm](examples/backends/sglang/slurm_jobs/README.md) and [TRT-LLM on Slurm](examples/basics/multinode/trtllm/README.md) for deployment examples.
381+
382+
## More News
383+
384+
- [11/20] [Dell integrates PowerScale with Dynamo's NIXL for 19x faster TTFT](https://www.dell.com/en-us/dt/corporate/newsroom/announcements/detailpage.press-releases~usa~2025~11~dell-technologies-and-nvidia-advance-enterprise-ai-innovation.htm)
385+
- [11/20] [WEKA partners with NVIDIA on KV cache storage for Dynamo](https://siliconangle.com/2025/11/20/nvidia-weka-kv-cache-solution-ai-inferencing-sc25/)
386+
- [11/13] [Dynamo Office Hours Playlist](https://www.youtube.com/playlist?list=PL5B692fm6--tgryKu94h2Zb7jTFM3Go4X)
387+
- [10/16] [How Baseten achieved 2x faster inference with NVIDIA Dynamo](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/)
357388

358389
<!-- Reference links for Feature Compatibility Matrix -->
359390
[disagg]: docs/design_docs/disagg_serving.md

docs/_includes/install.rst

Lines changed: 0 additions & 44 deletions
This file was deleted.

0 commit comments

Comments
 (0)