You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> **[Architecture Deep Dive →](docs/design_docs/architecture.md)**
70
+
66
71
## Latest News
67
72
68
73
-[12/05][Moonshot AI's Kimi K2 achieves 10x inference speedup with Dynamo on GB200](https://quantumzeitgeist.com/kimi-k2-nvidia-ai-ai-breakthrough/)
69
74
-[12/02][Mistral AI runs Mistral Large 3 with 10x faster inference using Dynamo](https://www.marktechpost.com/2025/12/02/nvidia-and-mistral-ai-bring-10x-faster-inference-for-the-mistral-3-family-on-gb200-nvl72-gpu-systems/)
70
75
-[12/01][InfoQ: NVIDIA Dynamo simplifies Kubernetes deployment for LLM inference](https://www.infoq.com/news/2025/12/nvidia-dynamo-kubernetes/)
71
-
-[11/20][Dell integrates PowerScale with Dynamo's NIXL for 19x faster TTFT](https://www.dell.com/en-us/dt/corporate/newsroom/announcements/detailpage.press-releases~usa~2025~11~dell-technologies-and-nvidia-advance-enterprise-ai-innovation.htm)
72
-
-[11/20][WEKA partners with NVIDIA on KV cache storage for Dynamo](https://siliconangle.com/2025/11/20/nvidia-weka-kv-cache-solution-ai-inferencing-sc25/)
-[10/16][How Baseten achieved 2x faster inference with NVIDIA Dynamo](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/)
75
76
76
77
## Get Started
77
78
78
79
| Path | Use Case | Time | Requirements |
79
80
|------|----------|------|--------------|
80
81
|[**Local Quick Start**](#local-quick-start)| Test on a single machine |~5 min | 1 GPU, Ubuntu 24.04 |
81
82
|[**Kubernetes Deployment**](#kubernetes-deployment)| Production multi-node clusters |~30 min | K8s cluster with GPUs |
83
+
|[**Building from Source**](#building-from-source)| Contributors and development |~15 min | Ubuntu, Rust, Python |
82
84
83
-
## Contributing
84
-
85
-
Want to help shape the future of distributed LLM inference? We welcome contributors at all levels—from doc fixes to new features.
86
-
87
-
-**[Contributing Guide](CONTRIBUTING.md)** – How to get started
88
-
-**[Report a Bug](https://github.com/ai-dynamo/dynamo/issues/new?template=bug_report.yml)** – Found an issue?
89
-
-**[Feature Request](https://github.com/ai-dynamo/dynamo/issues/new?template=feature_request.yml)** – Have an idea?
85
+
Want to help shape the future of distributed LLM inference? See the **[Contributing Guide](CONTRIBUTING.md)**.
90
86
91
87
# Local Quick Start
92
88
93
89
The following examples require a few system level packages.
94
90
Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/reference/support-matrix.md](docs/reference/support-matrix.md)
95
91
96
-
## 1. Initial Setup
92
+
## Install Dynamo
97
93
98
-
The Dynamo team recommends the `uv` Python package manager, although any way works. Install uv:
94
+
### Option A: Containers (Recommended)
99
95
100
-
```
101
-
curl -LsSf https://astral.sh/uv/install.sh | sh
102
-
```
96
+
Containers have all dependencies pre-installed. No setup required.
103
97
104
-
### Install Python Development Headers
98
+
```bash
99
+
# SGLang
100
+
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1
105
101
106
-
Backend engines require Python development headers for JIT compilation. Install them with:
102
+
# TensorRT-LLM
103
+
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1
107
104
108
-
```bash
109
-
sudo apt install python3-dev
105
+
# vLLM
106
+
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1
110
107
```
111
108
112
-
## 2. Select an Engine
109
+
> **Tip:** To run frontend and worker in the same container, either run processes in background with `&` (see below), or open a second terminal and use `docker exec -it <container_id> bash`.
113
110
114
-
We publish Python wheels specialized for each of our supported engines: vllm, sglang, and trtllm. The examples that follow use SGLang; continue reading for other engines.
111
+
See [Release Artifacts](docs/reference/release-artifacts.md#container-images)for available versions.
115
112
116
-
```
113
+
### Option B: Install from PyPI
114
+
115
+
The Dynamo team recommends the `uv` Python package manager, although any way works.
116
+
117
+
```bash
118
+
# Install uv (recommended Python package manager)
119
+
curl -LsSf https://astral.sh/uv/install.sh | sh
120
+
121
+
# Create virtual environment
117
122
uv venv venv
118
123
source venv/bin/activate
119
124
uv pip install pip
125
+
```
120
126
121
-
# Choose one
122
-
uv pip install "ai-dynamo[sglang]" #replace with [vllm], [trtllm], etc.
127
+
Install system dependencies and the Dynamo wheel for your chosen backend:
128
+
129
+
**SGLang**
130
+
131
+
```bash
132
+
sudo apt install python3-dev
133
+
uv pip install "ai-dynamo[sglang]"
123
134
```
124
135
125
-
## 3. Run Dynamo
136
+
> **Note:** For CUDA 13 (B300/GB300), the container is recommended. See [SGLang install docs](https://docs.sglang.ai/start/install.html) for details.
126
137
127
-
### Sanity Check (Optional)
138
+
**TensorRT-LLM**
128
139
129
-
Before trying out Dynamo, you can verify your system configuration and dependencies:
> **Note:** TensorRT-LLM requires `pip` due to a transitive Git URL dependency that `uv` doesn't resolve. We recommend using the [TensorRT-LLM container](docs/reference/release-artifacts.md#container-images) for broader compatibility.
147
+
148
+
**vLLM**
130
149
131
150
```bash
132
-
python3 deploy/sanity_check.py
151
+
sudo apt install python3-dev libxcb1
152
+
uv pip install "ai-dynamo[vllm]"
133
153
```
134
154
135
-
This is a quick check for system resources, development tools, LLM frameworks, and Dynamo components.
155
+
## Run Dynamo
136
156
137
-
### Running an LLM API Server
157
+
> **Tip (Optional):** Before running Dynamo, verify your system configuration with `python3 deploy/sanity_check.py`
138
158
139
159
Dynamo provides a simple way to spin up a local set of inference components including:
140
160
141
161
-**OpenAI Compatible Frontend** – High performance OpenAI compatible http api server written in Rust.
142
162
-**Basic and Kv Aware Router** – Route and load balance traffic to a set of workers.
143
163
-**Workers** – Set of pre-configured LLM serving engines.
144
164
165
+
Start the frontend:
166
+
167
+
> **Tip:** To run in a single terminal (useful in containers), append `> logfile.log 2>&1 &` to run processes in background. Example: `python3 -m dynamo.frontend --store-kv file > dynamo.frontend.log 2>&1 &`
168
+
145
169
```bash
146
170
# Start an OpenAI compatible HTTP server with prompt templating, tokenization, and routing.
147
171
# For local dev: --store-kv file avoids etcd (workers and frontend must share a disk)
> **Note:** vLLM workers publish KV cache events by default, which requires NATS. For dependency-free local development with vLLM, add `--kv-events-config '{"enable_kv_cache_events": false}'`. This keeps local prefix caching enabled while disabling event publishing. See [Service Discovery and Messaging](#service-discovery-and-messaging) for details.
189
+
> **Note:** For dependency-free local development, disable KV event publishing (avoids NATS):
> -**SGLang:** No flag needed (KV events disabled by default)
192
+
> -**TensorRT-LLM:** No flag needed (KV events disabled by default)
193
+
>
194
+
> **TensorRT-LLM only:** The warning `Cannot connect to ModelExpress server/transport error. Using direct download.` is expected and can be safely ignored.
195
+
>
196
+
> See [Service Discovery and Messaging](#service-discovery-and-messaging) for details.
> **Note:** TensorRT-LLM requires `pip` (not `uv`) due to URL-based dependencies. See the [TRT-LLM guide](docs/backends/trtllm/) for container setup and prerequisites.
222
-
223
-
Use `CUDA_VISIBLE_DEVICES` to specify which GPUs to use. Engine-specific options (context length, multi-GPU, etc.) are documented in each backend guide.
224
-
225
-
## Service Discovery and Messaging
226
-
227
-
Dynamo uses TCP for inter-component communication. External services are optional for most deployments:
228
-
229
-
| Deployment | etcd | NATS | Notes |
230
-
|------------|------|------|-------|
231
-
|**Kubernetes**| ❌ Not required | ❌ Not required | K8s-native discovery; TCP request plane |
232
-
|**Local Development**| ❌ Not required | ❌ Not required | Pass `--store-kv file`; vLLM also needs `--kv-events-config '{"enable_kv_cache_events": false}'`|
For local development without external dependencies, pass `--store-kv file` (avoids etcd) to both the frontend and workers. vLLM users should also pass `--kv-events-config '{"enable_kv_cache_events": false}'` to disable KV event publishing (avoids NATS) while keeping local prefix caching enabled; SGLang and TRT-LLM don't require this flag.
236
-
237
-
For distributed non-Kubernetes deployments or KV-aware routing:
238
-
239
-
-[etcd](https://etcd.io/) can be run directly as `./etcd`.
To quickly setup both: `docker compose -f deploy/docker-compose.yml up -d`
243
-
244
-
# Advanced Topics
245
-
246
-
## Benchmarking
247
-
248
-
Dynamo provides comprehensive benchmarking tools:
249
-
250
-
-**[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies using AIPerf
251
-
-**[SLA-Driven Deployments](docs/planner/sla_planner_quickstart.md)** – Optimize deployments to meet SLA requirements
252
-
253
-
## Frontend OpenAPI Specification
254
-
255
-
The OpenAI-compatible frontend exposes an OpenAPI 3 spec at `/openapi.json`. To generate without running the server:
256
-
257
-
```bash
258
-
cargo run -p dynamo-llm --bin generate-frontend-openapi
259
-
```
260
-
261
-
This writes to `docs/frontends/openapi.json`.
262
-
263
243
# Building from Source
264
244
265
245
For contributors who want to build Dynamo from source rather than installing from PyPI.
@@ -347,13 +327,64 @@ cd $PROJECT_ROOT
347
327
uv pip install -e .
348
328
```
349
329
350
-
You should now be able to run `python3 -m dynamo.frontend`.
330
+
## 8. Run the Frontend
331
+
332
+
```bash
333
+
python3 -m dynamo.frontend
334
+
```
335
+
336
+
## 9. Configure for Local Development
351
337
352
-
For local development, pass `--store-kv file` to avoid external dependencies (see Service Discovery and Messaging section).
338
+
- Pass `--store-kv file` to avoid external dependencies (see [Service Discovery and Messaging](#service-discovery-and-messaging))
339
+
- Set `DYN_LOG` to adjust the logging level (e.g., `export DYN_LOG=debug`). Uses the same syntax as `RUST_LOG`
353
340
354
-
Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`.
341
+
> **Note:** VSCode and Cursor users can use the `.devcontainer` folder for a pre-configured dev environment. See the [devcontainer README](.devcontainer/README.md) for details.
355
342
356
-
If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details.
343
+
# Advanced Topics
344
+
345
+
## Benchmarking
346
+
347
+
Dynamo provides comprehensive benchmarking tools:
348
+
349
+
-**[Benchmarking Guide](docs/benchmarks/benchmarking.md)** – Compare deployment topologies using AIPerf
350
+
-**[SLA-Driven Deployments](docs/planner/sla_planner_quickstart.md)** – Optimize deployments to meet SLA requirements
351
+
352
+
## Frontend OpenAPI Specification
353
+
354
+
The OpenAI-compatible frontend exposes an OpenAPI 3 spec at `/openapi.json`. To generate without running the server:
355
+
356
+
```bash
357
+
cargo run -p dynamo-llm --bin generate-frontend-openapi
358
+
```
359
+
360
+
This writes to `docs/frontends/openapi.json`.
361
+
362
+
## Service Discovery and Messaging
363
+
364
+
Dynamo uses TCP for inter-component communication. On Kubernetes, native resources ([CRDs + EndpointSlices](docs/kubernetes/service_discovery.md)) handle service discovery. External services are optional for most deployments:
365
+
366
+
| Deployment | etcd | NATS | Notes |
367
+
|------------|------|------|-------|
368
+
|**Local Development**| ❌ Not required | ❌ Not required | Pass `--store-kv file`; vLLM also needs `--kv-events-config '{"enable_kv_cache_events": false}'`|
369
+
|**Kubernetes**| ❌ Not required | ❌ Not required | K8s-native discovery; TCP request plane |
370
+
371
+
> **Note:** KV-Aware Routing requires NATS for prefix caching coordination.
372
+
373
+
For Slurm or other distributed deployments (and KV-aware routing):
374
+
375
+
-[etcd](https://etcd.io/) can be run directly as `./etcd`.
To quickly setup both: `docker compose -f deploy/docker-compose.yml up -d`
379
+
380
+
See [SGLang on Slurm](examples/backends/sglang/slurm_jobs/README.md) and [TRT-LLM on Slurm](examples/basics/multinode/trtllm/README.md) for deployment examples.
381
+
382
+
## More News
383
+
384
+
-[11/20][Dell integrates PowerScale with Dynamo's NIXL for 19x faster TTFT](https://www.dell.com/en-us/dt/corporate/newsroom/announcements/detailpage.press-releases~usa~2025~11~dell-technologies-and-nvidia-advance-enterprise-ai-innovation.htm)
385
+
-[11/20][WEKA partners with NVIDIA on KV cache storage for Dynamo](https://siliconangle.com/2025/11/20/nvidia-weka-kv-cache-solution-ai-inferencing-sc25/)
0 commit comments