Note
Tip
Scope of This Guide
This document guides you through setting up and testing a NVIDIA NeMo Agent Toolkit-compatible Dynamo inference server on a Linux/CUDA machine. By the end of this guide, you will be able to make curl requests to the endpoint and receive inference outputs from the Dynamo server.
For end-to-end integration with NeMo Agent Toolkit workflows, including detailed instructions and architectural considerations, see the Dynamo Integration Examples.
This guide covers setting up, running, and configuring the NVIDIA Dynamo backend for the React Benchmark Agent evaluations.
- Overview
- Prerequisites
- Starting Dynamo
- Building from Source
- Stopping Dynamo
- Testing the Integration
- Monitoring
- Dynamic Prefix Headers
- Configuration Reference
- Troubleshooting
Dynamo is NVIDIA's high-performance LLM serving platform with KV cache optimization. The scope of the current integration is based around two core aspects. First, we have implemented a Dynamo LLM support for NeMo Agent Toolkit inference on Dynamo runtimes. Second, we provide a set of startup scripts for NVIDIA Hopper and Blackwell GPU servers supporting NeMo Agent Toolkit runtimes at scale. The following Table defines each script:
| Mode | Script | Description | Best For |
|---|---|---|---|
| Unified | start_dynamo_unified.sh |
Workers responsible for both prefill and decode |
Development, testing |
| Unified + Thompson | start_dynamo_unified_thompson_hints.sh |
Unified with a predictive KV-aware router | Production, KV optimization |
| Disaggregated | start_dynamo_disagg.sh |
Separate prefill and decode workers |
High-throughput production |
┌──────────────────────────────────────────────────────────────────────────────┐
│ DYNAMO BACKEND ARCHITECTURE │
└──────────────────────────────────────────────────────────────────────────────┘
CLIENT REQUEST
(eval, curl, Python)
│
│ POST /v1/chat/completions
│ Headers:
│ x-prefix-id: react-bench-a1b2c3d4
│ x-prefix-total-requests: 10
│ x-prefix-osl: MEDIUM
│ x-prefix-iat: MEDIUM
│
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ DYNAMO FRONTEND │
│ Port 8099 │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ HTTP API (OpenAI Compatible) │ │
│ │ ───────────────────────────────────────────────────────────────────── │ │
│ │ • /v1/chat/completions - Chat completion endpoint │ │
│ │ • /v1/models - List available models │ │
│ │ • /health - Health check │ │
│ │ • Extract x-prefix-* headers for router hints │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ PROCESSOR │ │
│ │ ───────────────────────────────────────────────────────────────────── │ │
│ │ • Tokenize messages → token_ids │ │
│ │ • Extract prefix hints from headers │ │
│ │ • Format engine request │ │
│ │ • Track prefix state (outstanding requests) │ │
│ │ • CSV metrics logging │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ ROUTER │ │
│ │ ───────────────────────────────────────────────────────────────────── │ │
│ │ │ │
│ │ ┌──────────────────────┐ ┌──────────────────────────────────────┐ │ │
│ │ │ Worker Selection │ │ Thompson Sampling (Optional) │ │ │
│ │ │ ──────────────── │ │ ──────────────────────────────── │ │ │
│ │ │ 1. KV cache overlap│ │ • LinTS for continuous params │ │ │
│ │ │ 2. Worker affinity │ │ • Beta bandits for discrete │ │ │
│ │ │ 3. Load balancing │ │ • Explores vs exploits workers │ │ │
│ │ │ 4. OSL+IAT hints │ │ • Learns optimal routing │ │ │
│ │ └──────────────────────┘ └──────────────────────────────────────┘ │ │
│ │ │ │
│ │ Routing Decision Factors: │ │
│ │ • overlap_score: KV cache reuse potential │ │
│ │ • prefill_cost: Estimated prefill compute │ │
│ │ • decode_cost: Based on OSL hint (LOW=1.0, MEDIUM=2.0, HIGH=3.0) │ │
│ │ • iat_factor: Stickiness based on IAT (LOW=1.5, MEDIUM=1.0, HIGH=2.0) │ │
│ │ • load_modifier: Current worker queue depth │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │ │
└────────────────────────────────────┼─────────────────────────────────────────┘
│
│ Route to selected worker
│
┌────────────────────────────┴────────────────────────────────┐
│ │
▼ ▼
┌─────────────────────────────┐ ┌─────────────────────────────┐
│ UNIFIED WORKER │ OR │ DISAGGREGATED WORKERS │
│ (GPUs 0,1,2,3, TP=4) │ │ │
│ │ │ ┌────────────────────────┐ │
│ ┌───────────────────────┐ │ │ │ PREFILL WORKER │ │
│ │ SGLang Engine │ │ │ │ (GPUs 0,1, TP=2) │ │
│ │ ───────────────── │ │ │ │ • Initial KV compute │ │
│ │ • Model: Llama-3.3-70B │ │ │ • Sends KV via NIXL │ │
│ │ • KV Cache Management│ │ │ └───────────┬────────────┘ │
│ │ • Token Generation │ │ │ │ │
│ │ • Streaming Support │ │ │ │ NIXL KV │
│ └───────────────────────┘ │ │ │ Transfer │
│ │ │ ▼ │
│ All operations in one │ │ ┌────────────────────────┐ │
│ worker │ │ │ DECODE WORKER │ │
│ │ │ │ (GPUs 2,3, TP=2) │ │
│ │ │ │ • Token generation │ │
│ │ │ │ • Streaming output │ │
│ │ │ └────────────────────────┘ │
└─────────────────────────────┘ └─────────────────────────────┘
│ │
└─────────────────────────────┬───────────────────────────────┘
│
▼
┌──────────────────────┐
│ STREAMING RESPONSE │
│ ────────────────────│
│ {"choices": [...], │
│ "content": "..."} │
└──────────────────────┘
┌──────────────────────────────────────────────────────────────────────────────┐
│ INFRASTRUCTURE SERVICES │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────┐ ┌────────────────────────┐ │
│ │ ETCD │ │ NATS │ │
│ │ ──────────────────── │ │ ──────────────────── │ │
│ │ • Worker discovery │ │ • Message queue │ │
│ │ • Metadata storage │ │ • Prefill requests │ │
│ │ • Health tracking │ │ • JetStream enabled │ │
│ │ Port: 2379/2389 │ │ Port: 4222/4232 │ │
│ └────────────────────────┘ └────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
Warning
This example requires a Linux system with an NVIDIA GPU. See the Dynamo Support Matrix for full details.
Supported Platforms:
- Ubuntu 22.04 / 24.04 (x86_64)
- Ubuntu 24.04 (ARM64)
- CentOS Stream 9 (x86_64, experimental)
Not Supported:
- ❌ macOS (Intel or Apple Silicon)
- ❌ Windows
You do not need to install ai-dynamo or ai-dynamo-runtime packages locally. The Dynamo server runs inside pre-built Docker images from NGC (nvcr.io/nvidia/ai-dynamo/sglang-runtime), which include all necessary components. The NeMo Agent Toolkit Dynamo LLM client (_type: dynamo) is a pure HTTP client that works on any platform.
| Component | Minimum | Recommended |
|---|---|---|
| GPU Architecture | NVIDIA Hopper (H100) | B200 for higher throughput |
| GPU Count | 2 GPUs for small models (2 workers) | 8 GPUs for optimal performance |
| GPU Memory | 80GB per GPU (H100) | 192GB per GPU (B200) |
| System RAM | 256GB | 512GB+ |
Note: The Llama-3.3-70B-Instruct model requires approximately 140GB of GPU memory when loaded with TP=4 (tensor parallelism across 4 GPUs). Ensure your GPU configuration has sufficient aggregate memory. If the Llama-3.3-70B-Instruct does not fit into your GPU memory, follow the same steps with the Llama-3.1-8B-Instruct for QA validation.
Warning
This example requires a CUDA-compatible device with NVIDIA drivers installed. It cannot be run on systems without NVIDIA GPU hardware. You do not need to install ai-dynamo packages separately; the provided Docker images include them.
-
Docker installed and running (version 24.0+), with NVIDIA Container Toolkit
-
NVIDIA Driver with CUDA 12.0+ support,
nvidia-fabricmanagerenabled matchingNVIDIA-SMIversion. Verify with:docker run --rm --gpus all nvidia/cuda:12.4.0-runtime-ubuntu22.04 \ bash -c "apt-get update && apt-get install -y python3-pip && pip3 install torch && python3 -c 'import torch; print(torch.cuda.is_available())'"The output should show
True. If it showsFalsewith error 802, ensurenvidia-fabricmanageris installed, running, and matches your driver version. -
Hugging Face CLI for model downloads (optional, if model not already downloaded)
-
Llama-3.3-70B-Instruct model downloaded locally
-
Python uv environment python version 3.11-3.13
cd /path/to/NeMo-Agent-Toolkit
uv venv "${HOME}/.venvs/nat_dynamo_eval" --python 3.13
source "${HOME}/.venvs/nat_dynamo_eval/bin/activate"
# install the NeMo Agent Toolkit
uv pip install -e ".[langchain]"
uv pip install -e examples/dynamo_integration/react_benchmark_agentTo activate an existing environment:
source "${HOME}/.venvs/nat_dynamo_eval/bin/activate"Before running the Dynamo scripts, configure the following environment variables. See .env.example for a complete list of all available options.
cd external/dynamo/
# Copy and customize the example environment file
cp .env.example .env
# Edit with your settings
vi .env
# Source the environment before running scripts
source .envOR set variables directly:
export HF_HOME=/path/to/local/storage/.cache/huggingface
export HF_TOKEN=my_huggingface_read_token
# Required: Set your model directory path
export DYNAMO_MODEL_DIR=/path/to/your/models/Llama-3.3-70B-Instruct # or Llama-3.1-3B-Instruct for QA on H100 machines
# Optional: Set repository directory (for Thompson Sampling router)
export DYNAMO_REPO_DIR=/path/to/NeMo-Agent-Toolkit/external/dynamo
# Optional: Configure GPU devices (default: 0,1,2,3)
export DYNAMO_GPU_DEVICES=0,1,2,3[ -f .env ] && source .env || { echo "Warning: .env not found" >&2; false; }
# Change to the target model directory (create it if still needed)
cd "$(dirname "$DYNAMO_MODEL_DIR")"
# We will download the model weights directly from HuggingFace. See `NOTE` below.
uv pip install huggingface_hub
uv run huggingface-cli login # Set or enter your HF token.
# OR: run it with python: `python -c "from huggingface_hub import login; login()"`
uv run huggingface-cli download "meta-llama/Llama-3.3-70B-Instruct" --local-dir "$DYNAMO_MODEL_DIR"
# OR: run it with python: `python -c "from huggingface_hub import snapshot_download; snapshot_download('meta-llama/Llama-3.3-70B-Instruct', local_dir='$DYNAMO_MODEL_DIR')"`Note
The Llama-3.3-70B-Instruct model requires approval from Meta. Request access at huggingface.co/meta-llama/Llama-3.3-70B-Instruct before downloading. You will need to create a HuggingFace Access Token with read access in order to download the model. On the HuggingFace website visit: "Access Tokens" -> "+ Create access token" to generate a token starting with hf_. Enter your token when prompted. Respond "n" when asked "Add token as git credential? (Y/n)". Set HF_HOME and HF_TOKEN in .env..
# Check NVIDIA driver and GPU availability
nvidia-smi
# Expected output should show:
# - At least 4 GPUs (H100 or B200)
# - CUDA version 12.0+
# - Sufficient free memory per GPUExample output for an 8-GPU system:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06 Driver Version: 580.65.06 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA B200 On | 00000000:1B:00.0 Off | 0 |
| N/A 31C P0 187W / 1000W | 169082MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA B200 On | 00000000:43:00.0 Off | 0 |
| N/A 31C P0 187W / 1000W | 169178MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA B200 On | 00000000:52:00.0 Off | 0 |
| N/A 36C P0 193W / 1000W | 169230MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA B200 On | 00000000:61:00.0 Off | 0 |
| N/A 36C P0 195W / 1000W | 169230MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA B200 On | 00000000:9D:00.0 Off | 0 |
| N/A 32C P0 139W / 1000W | 4MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA B200 On | 00000000:C3:00.0 Off | 0 |
| N/A 30C P0 139W / 1000W | 4MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA B200 On | 00000000:D1:00.0 Off | 0 |
| N/A 34C P0 141W / 1000W | 4MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA B200 On | 00000000:DF:00.0 Off | 0 |
| N/A 35C P0 139W / 1000W | 4MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
# Verify Docker is running
docker infoStartup scripts can be found in the same directory (NeMo-Agent-Toolkit/external/dynamo/) at this README.md
Single worker handling all operations. Simpler setup, good for development and testing.
cd /path/to/NeMo-Agent-Toolkit/external/dynamo
# Start Dynamo (do NOT use 'source')
bash start_dynamo_unified.sh > startup_output.txt 2>&1
# Wait for startup (watch GPU memory)
watch -n 1 nvidia-smi
# Verify Dynamo is running
curl -sv http://localhost:8099/health
# Expected: "HTTP/1.1 200 OK"
# when testing is complete, shut down the containers with:
bash stop_dynamo.shComponents started:
etcdcontainer (etcd-dynamo) on port 2389natscontainer (nats-dynamo) on port 4232- Dynamo container (
dynamo-sglang) with unified worker on GPUs 0,1,2,3 (TP=4)
Startup time: Startup time may vary between 5-20 minutes for a 70B model, depending on the state of the system cache.
Unified worker with custom predictive KV-aware router using Thompson Sampling for optimal request routing.
cd /path/to/NeMo-Agent-Toolkit/external/dynamo
# Start Dynamo with Thompson Sampling router
bash start_dynamo_unified_thompson_hints.sh > startup_output.txt 2>&1
# Wait for startup
watch -n 1 nvidia-smi
# Verify
curl -sv http://localhost:8099/health
# when testing is complete, shut down the containers with:
bash stop_dynamo.shAdditional features:
- Custom frontend with prefix hint header support
- Thompson Sampling router (LinTS + Beta bandits)
- KV cache overlap optimization
- Workload-aware routing based on OSL and IAT hints
Custom components location: generalized/
frontend.py- Accepts x-prefix-* headersprocessor.py- Forwards hints to router, CSV metrics loggingrouter.py- Thompson Sampling, KV overlap calculations
Separate prefill and decode workers for maximum throughput. More complex setup.
cd /path/to/NeMo-Agent-Toolkit/external/dynamo
export DYNAMO_PREFILL_GPUS=0,1
export DYNAMO_DECODE_GPUS=2,3
# Start Dynamo disaggregated
bash start_dynamo_disagg.sh > startup_output.txt 2>&1
# Wait for startup (both workers need to initialize)
watch -n 1 nvidia-smi
# Verify
curl -sv http://localhost:8099/health
# when testing is complete, shut down the containers with:
bash stop_dynamo.shComponents started:
etcdcontainer on port 2379natscontainer on port 4222prefillWorker on GPUs 0,1 (TP=2)decodeWorker on GPUs 2,3 (TP=2)- Dynamo Frontend on port 8099
Startup time: ~5 minutes (both workers must initialize)
Note: Disaggregated mode uses NIXL for KV cache transfer between workers.
Instead of using pre-built NGC containers, you can build Dynamo runtime images directly from the dynamo main branch. This is useful for testing unreleased features or customizing the build.
The startup scripts (start_dynamo_optimized_thompson_hints_vllm.sh and start_dynamo_optimized_thompson_hints_sglang.sh) support source-built images through two .env variables:
DYNAMO_FROM_SOURCE=true— enables source-build mode; forces use ofprocessor_multilru.pyandrouter_multilru.pyDYNAMO_IMAGE— the Docker image tag to build and use (for example,dynamo-sglang-source:main)
Set these in your .env file:
DYNAMO_FROM_SOURCE=true
DYNAMO_IMAGE="dynamo-sglang-source:main" # or dynamo-vllm-source:main for vLLMThe build requires the following system packages on Ubuntu:
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config \
libclang-dev protobuf-compiler python3-dev cmakeRun the following commands from the root of the cloned dynamo repository:
cd /path/to/dynamo
# Render the SGLang Dockerfile from templates
python container/render.py --framework=sglang --target=runtime --output-short-filename
# Build the image (takes 30–90 minutes on first build; subsequent builds use cache)
docker build -t dynamo-sglang-source:main -f container/rendered.Dockerfile .cd /path/to/dynamo
# Render the vLLM Dockerfile from templates
python container/render.py --framework=vllm --target=runtime --output-short-filename
# Build the image
docker build -t dynamo-vllm-source:main -f container/rendered.Dockerfile .Once the image is built, run the startup script as normal — it automatically picks up DYNAMO_FROM_SOURCE and DYNAMO_IMAGE from .env:
cd /path/to/NeMo-Agent-Toolkit/external/dynamo
# SGLang
bash start_dynamo_optimized_thompson_hints_sglang.sh > startup_output.txt 2>&1
# vLLM
bash start_dynamo_optimized_thompson_hints_vllm.sh > startup_output.txt 2>&1If the image specified by DYNAMO_IMAGE does not exist, the script will print the exact build commands and exit with an error.
Note: The dynamo
mainbranch targets a different SGLang/vLLM version than the pre-built NGC containers. Verify the bundled framework version after building withdocker run --rm <image> python -c "import sglang; print(sglang.__version__)".
After starting Dynamo with any of the above options, verify the integration is working.
Note
Commands in this section require the virtual environment to be active. See uv Python Environment.
Run simple workflows to test basic connectivity and prefix header support:
cd /path/to/NeMo-Agent-Toolkit
# Test basic Dynamo connectivity
nat run --config_file examples/dynamo_integration/react_benchmark_agent/configs/config_dynamo_e2e_test.yml \
--input "What time is it?"
# Test Dynamo with dynamic prefix headers (for Predictive KV-Aware Cache router)
nat run --config_file examples/dynamo_integration/react_benchmark_agent/configs/config_dynamo_prefix_e2e_test.yml \
--input "What time is it?"For comprehensive validation, run the integration test script:
Note
Requires the virtual environment to be active. See uv Python Environment.
cd /path/to/NeMo-Agent-Toolkit/external/dynamo
bash test_dynamo_integration.shEnvironment variables (optional):
DYNAMO_BACKEND- Backend type:sglang#vllmand tensorRT still need to be developedDYNAMO_MODEL- Model name (default:llama-3.3-70b)DYNAMO_PORT- Frontend port (default:8099)
Tests performed:
- NeMo Agent Toolkit environment is active
- Configuration files exist
- Dynamo frontend is responding on the configured port
- Basic chat completion request works
- Workflow with basic config runs successfully
- Workflow with prefix hints runs successfully
Expected output (all tests passing):
==========================================
Testing react_benchmark_agent with Dynamo
==========================================
Backend: sglang
Model: llama-3.3-70b
Port: 8099
==========================================
0. Checking if NAT environment is active...
✓ NAT command found
1. Checking if configuration files exist...
✓ Configuration files found
2. Checking if Dynamo frontend is running on port 8099...
✓ Dynamo frontend is running
3. Testing basic Dynamo endpoint...
✓ Dynamo endpoint is working
4. Testing NAT workflow with Dynamo (basic config)...
✓ Basic config test completed successfully
5. Testing NAT workflow with Dynamo (with prefix hints)...
✓ Prefix hints test completed successfully
==========================================
Test Summary
==========================================
Total tests: 6
Passed: 6
Failed: 0
✓ All tests passed!
What the test validates:
- The environment is activated
- Configuration files exist
- Dynamo frontend is running on port 8099
- Dynamo endpoint responds correctly
- Workflow executes with basic config
- Workflow executes with prefix hints
If any tests fail, the script provides guidance on how to fix the issue.
A single script stops all Dynamo components regardless of which mode was started:
cd /path/to/NeMo-Agent-Toolkit/external/dynamo
bash stop_dynamo.shWhat it stops:
- Dynamo container (
dynamo-sglangordynamo-sglang-thompson) etcdcontainer (etcd-dynamo)natscontainer (nats-dynamo)
Output:
=========================================================
Stopping Dynamo SGLang FULL STACK
=========================================================
Stopping Dynamo container (standard)...
✓ Dynamo container stopped and removed
Stopping ETCD container...
✓ ETCD container stopped and removed
Stopping NATS container...
✓ NATS container stopped and removed
=========================================================
✓ All components stopped!
=========================================================
Note
Commands in this section require the virtual environment to be active. See uv Python Environment.
cd /path/to/NeMo-Agent-Toolkit
# Basic Dynamo test
nat run --config_file examples/dynamo_integration/react_benchmark_agent/configs/config_dynamo_e2e_test.yml \
--input "What time is it?"
# With prefix headers (for Thompson Sampling router)
nat run --config_file examples/dynamo_integration/react_benchmark_agent/configs/config_dynamo_prefix_e2e_test.yml \
--input "What time is it?"# Basic chat completion
curl http://localhost:8099/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.3-70b",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
# Streaming test
curl http://localhost:8099/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.3-70b",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50,
"stream": true
}'cd /path/to/NeMo-Agent-Toolkit/external/dynamo
./monitor_dynamo.shMenu options:
- View Frontend logs
- View Processor logs
- View Router logs
- View all component logs
- View container logs
- Test health endpoint
- Test basic inference
- Check GPU usage
- Check process status
# View container logs
docker logs -f dynamo-sglang
# View `etcd` logs
docker logs -f etcd-dynamo
# View `nats` logs
docker logs -f nats-dynamo
# GPU utilization
watch -n 2 nvidia-smi
# Check running containers
docker ps --format "table {{.Names}}\t{{.Status}}"When using the Thompson Sampling router (start_dynamo_unified_thompson_hints.sh), dynamic prefix headers enable optimal KV cache management and request routing.
Prefix headers help the router:
- Identify related requests for KV cache reuse
- Make routing decisions based on workload characteristics
- Track prefix state for optimal worker selection
- Improve throughput through intelligent batching
Prefix headers do not include KV cache overlap. The router computes KV cache overlap scores by querying the backend through dynamo.llm.KvIndexer.
If overlap scores are unavailable, the router cannot account for KV cache match when routing and will behave like a non-KV-aware router for that signal.
This can happen in the following configuration:
- You are using a Dynamo image or build that does not include
dynamo.llmKV routing classes. In this case, the router logs a warning thatdynamo.llmis not available and overlap scores will be empty.
To confirm overlap scores are missing, check router_metrics.csv and verify that overlap_chosen is always 0.000000.
Use the dynamo LLM type in your eval config. Prefix headers are sent by default:
llms:
dynamo_llm:
_type: dynamo
model_name: llama-3.3-70b
base_url: http://localhost:8099/v1
api_key: dummy
# Prefix headers are enabled by default with template "nat-dynamo-{uuid}"
# Optional: customize the template or routing hints
# prefix_template: "react-benchmark-{uuid}" # Custom template
# prefix_template: null # Set to null to disable prefix headers entirely
prefix_total_requests: 10 # Expected requests per prefix
prefix_osl: MEDIUM # Output Sequence Length: LOW | MEDIUM | HIGH
prefix_iat: MEDIUM # Inter-Arrival Time: LOW | MEDIUM | HIGHNote: The
dynamoLLM type automatically sends prefix headers using the default templatenat-dynamo-{uuid}. To disable prefix headers entirely, setprefix_template: nullin your config.
| Header | Description | Values |
|---|---|---|
x-prefix-id |
Unique identifier for request group | UUID-based string (null to disable all extra headers) |
x-prefix-total-requests |
Expected total requests for this prefix | Integer (1 for independent queries) |
x-prefix-osl |
Output Sequence Length hint | LOW (~50 tokens), MEDIUM (~200), HIGH (~500+) |
x-prefix-iat |
Inter-Arrival Time hint | LOW (rapid), MEDIUM (normal), HIGH (long delays) |
Each question is independent, uses default prefix template:
llms:
eval_llm:
_type: dynamo
# prefix_template defaults to "nat-dynamo-{uuid}"
prefix_total_requests: 1
prefix_osl: MEDIUM
prefix_iat: LOW # Eval runs many queries quicklyRelated requests should share a prefix:
llms:
chat_llm:
_type: dynamo
prefix_template: "chat-{uuid}" # Optional: custom template
prefix_total_requests: 8 # Average conversation length
prefix_osl: MEDIUM
prefix_iat: HIGH # Users take time to typeReAct agents make multiple related calls:
llms:
agent_llm:
_type: dynamo
prefix_template: "agent-{uuid}" # Optional: custom template
prefix_total_requests: 5 # Typical tool call sequence
prefix_osl: LOW # Tool calls produce short responses
prefix_iat: LOW # Agent runs tool calls rapidly- NeMo Agent Toolkit Configurations uses
_type: dynamo(prefix headers enabled by default) - Dynamo LLM Provider generates unique UUID per request using the template
- Headers injected into HTTP request:
x-prefix-id: react-benchmark-a1b2c3d4e5f6g7h8 x-prefix-total-requests: 1 x-prefix-osl: MEDIUM x-prefix-iat: MEDIUM - Dynamo Frontend extracts headers
- Processor tracks prefix state
- Router makes routing decisions based on:
- KV cache overlap with existing prefixes
- Worker affinity for related requests
- Load balancing across workers
- Workload hints (OSL and IAT)
The startup scripts support configuration through environment variables. Set these before running the scripts:
| Variable | Description | Default |
|---|---|---|
DYNAMO_MODEL_DIR |
Local path to the model directory | (required) |
DYNAMO_REPO_DIR |
Path to NeMo-Agent-Toolkit repository | Auto-detected |
DYNAMO_GPU_DEVICES |
Comma-separated GPU device IDs | 0,1,2,3 |
DYNAMO_HTTP_PORT |
Frontend HTTP port | 8099 |
DYNAMO_ETCD_PORT |
etcd client port |
2389 |
DYNAMO_NATS_PORT |
nats messaging port |
4232 |
DYNAMO_METRICS_URL |
Prometheus metrics endpoint URL for the router | http://localhost:9090/metrics |
ROUTER_METRICS_CSV |
Path to CSV file for router decision logging | router_metrics.csv |
Example configuration:
# Configure environment before running scripts
export DYNAMO_MODEL_DIR=/path/to/models/Llama-3.3-70B-Instruct
export DYNAMO_GPU_DEVICES=0,1,2,3
export DYNAMO_HTTP_PORT=8099
# Then start Dynamo
bash start_dynamo_unified.shEach startup script also has configurable variables at the top that can be edited directly:
# start_dynamo_unified.sh
CONTAINER_NAME="dynamo-sglang"
WORKER_GPUS="${DYNAMO_GPU_DEVICES:-0,1,2,3}" # Override with env var or edit default
TP_SIZE=4
HTTP_PORT="${DYNAMO_HTTP_PORT:-8099}"
MODEL="/workspace/models/Llama-3.3-70B-Instruct"
SERVED_MODEL_NAME="llama-3.3-70b"
IMAGE="nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
SHM_SIZE="16g"
# Infrastructure ports (non-default to avoid conflicts)
ETCD_CLIENT_PORT="${DYNAMO_ETCD_PORT:-2389}"
NATS_PORT="${DYNAMO_NATS_PORT:-4232}"
# Local paths - MUST be set via environment variable or edited here
LOCAL_MODEL_DIR="${DYNAMO_MODEL_DIR:?Error: DYNAMO_MODEL_DIR environment variable must be set}"Option 1: Use environment variable (recommended):
export DYNAMO_GPU_DEVICES=0,1,2,3
bash start_dynamo_unified.shOption 2: Edit the script directly:
# In the script, change:
WORKER_GPUS="0,1,2,3"
# The docker run command will use:
--gpus '"device=0,1,2,3"'For a different model, update both the model directory and served name:
# Set environment variable for model path
export DYNAMO_MODEL_DIR="${HOME}/models/Llama-3.1-8B-Instruct"
# Edit script variables for model metadata
MODEL="/workspace/models/Llama-3.1-8B-Instruct"
SERVED_MODEL_NAME="llama-3.1-8b"
TP_SIZE=2 # Smaller models may need fewer GPUsOption 1: Use environment variables:
export DYNAMO_HTTP_PORT=8080
export DYNAMO_ETCD_PORT=2379
export DYNAMO_NATS_PORT=4222
bash start_dynamo_unified.shOption 2: Edit script directly:
HTTP_PORT=8080
ETCD_CLIENT_PORT=2379
NATS_PORT=4222The Thompson Sampling router (start_dynamo_unified_thompson_hints.sh) produces three CSV files for monitoring and analysis. These files are located in /workspace/metrics/ inside the container.
# From the host
docker exec dynamo-sglang cat /workspace/metrics/router_metrics.csv
docker exec dynamo-sglang cat /workspace/metrics/processor_requests.csv
docker exec dynamo-sglang cat /workspace/metrics/frontend_throughput.csv
# From inside the container
docker exec -it dynamo-sglang bash
cat /workspace/metrics/router_metrics.csvLogs every routing decision made by the Thompson Sampling router.
Columns:
| Column | Description |
|---|---|
ts_epoch_ms |
Timestamp in milliseconds since epoch |
tokens_len |
Number of tokens in the request |
prefix_id |
Unique prefix identifier (auto-generated or from header) |
reuse_after |
Remaining reuse budget after this request |
chosen_worker |
Integer ID of the selected worker |
overlap_chosen |
KV cache overlap score (0.0-1.0) |
decode_cost |
Estimated decode cost |
prefill_cost |
Estimated prefill cost |
iat_level |
Inter-arrival time hint (LOW, MEDIUM, or HIGH) |
stickiness |
Worker affinity score |
load_mod |
Load modifier applied |
Example output:
ts_epoch_ms,tokens_len,prefix_id,reuse_after,chosen_worker,overlap_chosen,decode_cost,prefill_cost,iat_level,stickiness,load_mod
1767923263058,38,auto-9e05dbb0682f458a89b82f64bb328011,0,7587892060544177931,0.000000,2.000000,0.037109,MEDIUM,0.000,1.000000Logs latency metrics for each processed request.
Columns:
| Column | Description |
|---|---|
num_tokens |
Number of output tokens generated |
latency_ms |
Total request latency in milliseconds |
latency_ms_per_token |
Average latency per token |
Example output:
num_tokens,latency_ms,latency_ms_per_token
10,70152.021,7015.202100Logs throughput metrics at regular intervals (default: every 5 seconds).
Columns:
| Column | Description |
|---|---|
ts_epoch_ms |
Timestamp in milliseconds since epoch |
requests |
Number of requests completed in this interval |
interval_s |
Length of the measurement interval in seconds |
req_per_sec |
Computed requests per second |
Example output:
ts_epoch_ms,requests,interval_s,req_per_sec
1767923267849,0,5.000,0.000000
1767923272850,0,5.000,0.000000
1767923337856,1,5.000,0.200000
1767923342856,0,5.000,0.000000Check logs:
docker logs dynamo-sglangCommon causes:
- GPU not available
- Model path incorrect
- Port already in use
# Check if container is running
docker ps --format '{{.Names}}'
# Check what's listening on port 8099
ss -tlnp | grep 8099# Check `etcd` health
curl http://localhost:2389/health
# Check `etcd` logs
docker logs etcd-dynamo# Check `nats` is running
docker ps | grep nats-dynamo
# Check `nats` logs
docker logs nats-dynamoSymptom: KeyError: 'token_ids' or tokenizer errors
Fix: Clear etcd data and restart
bash stop_dynamo.sh
# Wait a few seconds
bash start_dynamo_unified.shSymptom: Takes 3+ minutes to start
Causes:
- 70B model takes ~90-120 seconds normally
- Cold cache may require model download
- Insufficient GPU memory causes swapping
Monitoring:
# Watch GPU memory during startup
watch -n 1 nvidia-smiKnown Issue: Disaggregated mode may have issues with streaming requests.
Workaround: Use unified mode for streaming, or use non-streaming requests:
{"stream": false}external/dynamo/ # Dynamo backend
│
├── 📄 README.md # This file - Dynamo setup guide
├── 📄 .env.example # Example environment variables
├── 🔧 start_dynamo_unified.sh # Start Dynamo (unified mode)
├── 🔧 start_dynamo_unified_thompson_hints.sh # Start with Thompson router
├── 🔧 start_dynamo_disagg.sh # Start Dynamo (disaggregated)
├── 🔧 stop_dynamo.sh # Stop all Dynamo services
├── 🔧 test_dynamo_integration.sh # Integration tests
├── 🔧 monitor_dynamo.sh # Monitor running services
│
└── 📁 generalized/ # Custom router components
├── frontend.py # Prefix header extraction
├── processor.py # Request processing + metrics
└── router.py # Thompson Sampling router
| Command | Description |
|---|---|
bash start_dynamo_unified.sh |
Start unified mode |
bash start_dynamo_unified_thompson_hints.sh |
Start with Thompson router |
bash start_dynamo_disagg.sh |
Start disaggregated mode |
bash stop_dynamo.sh |
Stop all services |
./test_dynamo_integration.sh |
Run integration tests |
./monitor_dynamo.sh |
Interactive monitoring |
curl localhost:8099/health |
Health check |
docker logs -f dynamo-sglang |
View logs |
nat run --config_file examples/dynamo_integration/react_benchmark_agent/configs/config_dynamo_e2e_test.yml --input "..." |
Quick NeMo Agent Toolkit validation |
nat run --config_file examples/dynamo_integration/react_benchmark_agent/configs/config_dynamo_prefix_e2e_test.yml --input "..." |
Test with prefix headers |
| Container | Description |
|---|---|
dynamo-sglang |
Standard Dynamo worker |
etcd-dynamo |
Service discovery and metadata |
nats-dynamo |
Message queue for prefill requests |
- React Benchmark Agent - Complete evaluation guide
- Architecture - System diagrams
Now that you have a running Dynamo server and can make curl requests to the endpoint, you're ready to integrate with NeMo Agent Toolkit workflows.
Tip
Ready for Full Integration?
Visit the Dynamo Integration Examples for:
- End-to-end workflow integration with NeMo Agent Toolkit
- Benchmark agent configurations and evaluation harnesses
- Performance analysis scripts and visualization tools
- Architectural deep-dives on toolkit-Dynamo integration patterns