LLM Deployment using vLLM

This directory contains reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.

Use the Latest Release

We recommend using the latest stable release of Dynamo to avoid breaking changes:

You can find the latest release here and check out the corresponding branch with:

git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

Feature Support Matrix
Quick Start
Single Node Examples
Advanced Examples
Deploy on Kubernetes
Configuration

Feature Support Matrix

Core Dynamo Features

Feature	vLLM	Notes
Disaggregated Serving	✅
Conditional Disaggregation	🚧	WIP
KV-Aware Routing	✅
SLA-Based Planner	✅
Load Based Planner	🚧	WIP
KVBM	✅
LMCache	✅
Prompt Embeddings	✅	Requires `--enable-prompt-embeds` flag

Large Scale P/D and WideEP Features

Feature	vLLM	Notes
WideEP	✅	Support for PPLX / DeepEP not verified
DP Rank Routing	✅	Supported via external control of DP ranks
GB200 Support	🚧	Container functional on main

vLLM Quick Start

Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

Start Infrastructure Services (Local Development Only)

For local/bare-metal development, start etcd and optionally NATS using Docker Compose:

docker compose -f deploy/docker-compose.yml up -d

Note

etcd is optional but is the default local discovery backend. You can also use --kv_store file to use file system based discovery.
NATS is optional - only needed if using KV routing with events (default). You can disable it with --no-kv-events flag for prediction-based routing
On Kubernetes, neither is required when using the Dynamo operator, which explicitly sets DYN_DISCOVERY_BACKEND=kubernetes to enable native K8s service discovery (DynamoWorkerMetadata CRD)

Pull or build container

We have public images available on NGC Catalog. If you'd like to build your own container from source:

python container/render.py --framework=vllm --target=runtime --short-output
docker build -t dynamo:vllm-latest -f container/rendered.Dockerfile .

Run container

./container/run.sh -it --framework VLLM [--mount-workspace]

This includes the specific commit vllm-project/vllm#19790 which enables support for external control of the DP ranks.

Run Single Node Examples

Important

Below we provide simple shell scripts that run the components for each configuration. Each shell script runs python3 -m dynamo.frontend to start the ingress and uses python3 -m dynamo.vllm to start the vLLM workers. You can also run each command in separate terminals for better log visibility.

Aggregated Serving

# requires one gpu
cd examples/backends/vllm
bash launch/agg.sh

Aggregated Serving with KV Routing

# requires two gpus
cd examples/backends/vllm
bash launch/agg_router.sh

Disaggregated Serving

# requires two gpus
cd examples/backends/vllm
bash launch/disagg.sh

Disaggregated Serving with KV Routing

# requires three gpus
cd examples/backends/vllm
bash launch/disagg_router.sh

Single Node Data Parallel Attention / Expert Parallelism

This example is not meant to be performant but showcases Dynamo routing to data parallel workers

# requires four gpus
cd examples/backends/vllm
bash launch/dep.sh

Tip

Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.

Advanced Examples

Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example!

Speculative Decoding with Aggregated Serving (Meta-Llama-3.1-8B-Instruct + Eagle3)

Run Meta-Llama-3.1-8B-Instruct with Eagle3 as a draft model using aggregated speculative decoding on a single node. This setup demonstrates how to use Dynamo to create an instance using Eagle-based speculative decoding under the VLLM aggregated serving framework for faster inference while maintaining accuracy.

Guide: Speculative Decoding Quickstart

See also: Speculative Decoding Feature Overview for cross-backend documentation.

Kubernetes Deployment

For complete Kubernetes deployment instructions, configurations, and troubleshooting, see vLLM Kubernetes Deployment Guide

Configuration

vLLM workers are configured through command-line arguments. Key parameters include:

--model: Model to serve (e.g., Qwen/Qwen3-0.6B)
--is-prefill-worker: Enable prefill-only mode for disaggregated serving
--metrics-endpoint-port: Port for publishing KV metrics to Dynamo
--connector: Specify which kv_transfer_config you want vllm to use [nixl, lmcache, kvbm, none]. This is a helper flag which overwrites the engines KVTransferConfig.
--enable-prompt-embeds: Enable prompt embeddings feature (opt-in, default: disabled)
- Required for: Accepting pre-computed prompt embeddings via API
- Default behavior: Prompt embeddings DISABLED - requests with prompt_embeds will fail
- Error without flag: ValueError: You must set --enable-prompt-embeds to input prompt_embeds

See args.py for the full list of configuration options and their defaults.

The documentation for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM.

Hashing Consistency for KV Events

When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following:

Set PYTHONHASHSEED=0 for all vLLM processes when relying on Python's builtin hashing for prefix caching.
If your vLLM version supports it, configure a deterministic prefix caching algorithm, for example:

vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256

See the high-level notes in Router Design on deterministic event IDs.

Request Migration

Dynamo supports request migration to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the Request Migration Architecture documentation for configuration details.

Request Cancellation

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.

Cancellation Support Matrix

	Prefill	Decode
Aggregated	✅	✅
Disaggregated	✅	✅

For more details, see the Request Cancellation Architecture documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Deployment using vLLM

Use the Latest Release

Table of Contents

Feature Support Matrix

Core Dynamo Features

Large Scale P/D and WideEP Features

vLLM Quick Start

Start Infrastructure Services (Local Development Only)

Pull or build container

Run container

Run Single Node Examples

Aggregated Serving

Aggregated Serving with KV Routing

Disaggregated Serving

Disaggregated Serving with KV Routing

Single Node Data Parallel Attention / Expert Parallelism

Advanced Examples

Speculative Decoding with Aggregated Serving (Meta-Llama-3.1-8B-Instruct + Eagle3)

Kubernetes Deployment

Configuration

Hashing Consistency for KV Events

Request Migration

Request Cancellation

Cancellation Support Matrix

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

LLM Deployment using vLLM

Use the Latest Release

Table of Contents

Feature Support Matrix

Core Dynamo Features

Large Scale P/D and WideEP Features

vLLM Quick Start

Start Infrastructure Services (Local Development Only)

Pull or build container

Run container

Run Single Node Examples

Aggregated Serving

Aggregated Serving with KV Routing

Disaggregated Serving

Disaggregated Serving with KV Routing

Single Node Data Parallel Attention / Expert Parallelism

Advanced Examples

Speculative Decoding with Aggregated Serving (Meta-Llama-3.1-8B-Instruct + Eagle3)

Kubernetes Deployment

Configuration

Hashing Consistency for KV Events

Request Migration

Request Cancellation

Cancellation Support Matrix