We recommend using the latest stable release of dynamo to avoid breaking changes:
You can find the latest release here and check out the corresponding branch with:
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))- Feature Support Matrix
- Dynamo SGLang Integration
- Installation
- Quick Start
- Single Node Examples
- Multi-Node and Advanced Examples
- Deploy on SLURM or Kubernetes
| Feature | SGLang | Notes |
|---|---|---|
| Disaggregated Serving | ✅ | |
| Conditional Disaggregation | 🚧 | WIP PR |
| KV-Aware Routing | ✅ | |
| SLA-Based Planner | ✅ | |
| Multimodal Support | ✅ | |
| KVBM | ❌ | Planned |
Dynamo SGLang integrates SGLang engines into Dynamo's distributed runtime, enabling advanced features like disaggregated serving, KV-aware routing, and request migration while maintaining full compatibility with SGLang's engine arguments.
Dynamo SGLang uses SGLang's native argument parser, so most SGLang engine arguments work identically. You can pass any SGLang argument (like --model-path, --tp, --trust-remote-code) directly to dynamo.sglang.
| Argument | Description | Default | SGLang Equivalent |
|---|---|---|---|
--endpoint |
Dynamo endpoint in dyn://namespace.component.endpoint format |
Auto-generated based on mode | N/A |
--migration-limit |
Max times a request can migrate between workers for fault tolerance. See Request Migration Architecture. | 0 (disabled) |
N/A |
--dyn-tool-call-parser |
Tool call parser for structured outputs (takes precedence over --tool-call-parser) |
None |
--tool-call-parser |
--dyn-reasoning-parser |
Reasoning parser for CoT models (takes precedence over --reasoning-parser) |
None |
--reasoning-parser |
--use-sglang-tokenizer |
Use SGLang's tokenizer instead of Dynamo's | False |
N/A |
--custom-jinja-template |
Use custom chat template for that model (takes precedence over default chat template in model repo) | None |
--chat-template |
- Default (
--use-sglang-tokenizernot set): Dynamo handles tokenization/detokenization via our blazing fast frontend and passesinput_idsto SGLang - With
--use-sglang-tokenizer: SGLang handles tokenization/detokenization, Dynamo passes raw prompts
Note
When using --use-sglang-tokenizer, only v1/chat/completions is available through Dynamo's frontend.
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
| Prefill | Decode | |
|---|---|---|
| Aggregated | ✅ | ✅ |
| Disaggregated | ✅ |
Warning
For more details, see the Request Cancellation Architecture documentation.
We suggest using uv to install the latest release of ai-dynamo[sglang]. You can install it with curl -LsSf https://astral.sh/uv/install.sh | sh
Expand for instructions
# create a virtual env
uv venv --python 3.12 --seed
# install the latest release (which comes bundled with a stable sglang version)
uv pip install "ai-dynamo[sglang]"Expand for instructions
This requires having rust installed. We also recommend having a proper installation of the cuda toolkit as sglang requires nvcc to be available.
# create a virtual env
uv venv --python 3.12 --seed
# build dynamo runtime bindings
uv pip install maturin
cd $DYNAMO_HOME/lib/bindings/python
maturin develop --uv
cd $DYNAMO_HOME
# installs sglang supported version along with dynamo
# include the prerelease flag to install flashinfer rc versions
uv pip install -e .
# install any sglang version >= 0.5.3.post2
uv pip install "sglang[all]==0.5.3.post2"Expand for instructions
We are in the process of shipping pre-built docker containers that contain installations of DeepEP, DeepGEMM, and NVSHMEM in order to support WideEP and P/D. For now, you can quickly build the container from source with the following command.
cd $DYNAMO_ROOT
./container/build.sh \
--framework SGLANG \
--tag dynamo-sglang:latest \And then run it using
docker run \
--gpus all \
-it \
--rm \
--network host \
--shm-size=10G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--ulimit nofile=65536:65536 \
--cap-add CAP_SYS_PTRACE \
--ipc host \
dynamo-sglang:latestBelow we provide a guide that lets you run all of our common deployment patterns on a single node.
Start using Docker Compose
docker compose -f deploy/docker-compose.yml up -dTip
Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
Additionally - because we use sglang's argument parser, you can pass in any argument that sglang supports to the worker!
cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg.shcd $DYNAMO_HOME/examples/backends/sglang
./launch/agg_router.shHere's an example that uses the Qwen/Qwen3-Embedding-4B model.
cd $DYNAMO_HOME/examples/backends/sglang
./launch/agg_embed.shSend the following request to verify your deployment:
curl localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Embedding-4B",
"input": "Hello, world!"
}'See SGLang Disaggregation to learn more about how sglang and dynamo handle disaggregated serving.
cd $DYNAMO_HOME/examples/backends/sglang
./launch/disagg.shcd $DYNAMO_HOME/examples/backends/sglang
./launch/disagg_router.shYou can use this configuration to test out disaggregated serving with dp attention and expert parallelism on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.
# note this will require 4 GPUs
cd $DYNAMO_HOME/examples/backends/sglang
./launch/disagg_dp_attn.shSend a test request to verify your deployment:
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
}
],
"stream": true,
"max_tokens": 30
}'Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
We currently provide deployment examples for Kubernetes and SLURM.