The scripts in the examples/<backend>/launch folder like agg.sh demonstrate how you can serve your models locally.
The corresponding YAML files like agg.yaml show you how you could create a Kubernetes deployment for your inference graph.
This guide explains how to create your own deployment files.
Before choosing a template, understand the different architecture patterns:
Pattern: Prefill and decode on the same GPU in a single process.
Suggested to use for:
- Small to medium models (under 70B parameters)
- Development and testing
- Low to moderate traffic
- Simplicity is prioritized over maximum throughput
Tradeoffs:
- Simpler setup and debugging
- Lower operational complexity
- GPU utilization may not be optimal (prefill and decode compete for resources)
- Lower throughput ceiling compared to disaggregated
Example: agg.yaml
Pattern: Load balancer routing across multiple aggregated worker instances.
Suggested to use for:
- Medium traffic requiring high availability
- Need horizontal scaling
- Want some load balancing without disaggregation complexity
Tradeoffs:
- Better scalability than plain aggregated
- High availability through multiple replicas
- Still has GPU underutilization issues of aggregated serving
- More complex than plain aggregated but simpler than disaggregated
Example: agg_router.yaml
Pattern: Separate prefill and decode workers with specialized optimization.
Suggested to use for:
- Production-style deployments
- High throughput requirements
- Large models (70B+ parameters)
- Maximum GPU utilization needed
Tradeoffs:
- Maximum performance and throughput
- Better GPU utilization (prefill and decode specialized)
- Independent scaling of prefill and decode
- More complex setup and debugging
- Requires understanding of prefill/decode separation
Example: disagg_router.yaml
Select the architecture pattern as your template that best fits your use case.
For example, when using the vLLM backend:
-
Development / Testing: Use
agg.yamlas the base configuration. -
Production with Load Balancing: Use
agg_router.yamlto enable scalable, load-balanced inference. -
High Performance / Disaggregated Deployment: Use
disagg_router.yamlfor maximum throughput and modular scalability.
You can run the Frontend on one machine, for example a CPU node, and the worker on a different machine (a GPU node). The Frontend serves as a framework-agnostic HTTP entry point and is likely not to need many changes.
It serves the following roles:
- OpenAI-Compatible HTTP Server
- Provides
/v1/chat/completionsendpoint - Handles HTTP request/response formatting
- Supports streaming responses
- Validates incoming requests
- Service Discovery and Routing
- Auto-discovers backend workers via etcd
- Routes requests to the appropriate Processor/Worker components
- Handles load balancing between multiple workers
- Request Preprocessing
- Initial request validation
- Model name verification
- Request format standardization
You should then pick a worker and specialize the config. For example,
VllmWorker: # vLLM-specific config
enforce-eager: true
enable-prefix-caching: true
SglangWorker: # SGLang-specific config
router-mode: kv
disagg-mode: true
TrtllmWorker: # TensorRT-LLM-specific config
engine-config: ./engine.yaml
kv-cache-transfer: ucxHere's a template structure based on the examples:
YourWorker:
dynamoNamespace: your-namespace
componentType: worker
replicas: N
envFromSecret: your-secrets # e.g., hf-token-secret
# Health checks for worker initialization
readinessProbe:
exec:
command: ["/bin/sh", "-c", 'grep "Worker.*initialized" /tmp/worker.log']
resources:
requests:
gpu: "1" # GPU allocation
extraPodSpec:
mainContainer:
image: your-image
command:
- /bin/sh
- -c
args:
- python -m dynamo.YOUR_INFERENCE_ENGINE --model YOUR_MODEL --your-flagsConsult the corresponding sh file. Each of the python commands to launch a component will go into your yaml spec under the
extraPodSpec: -> mainContainer: -> args:
The front end is launched with "python3 -m dynamo.frontend [--http-port 8000] [--router-mode kv]"
Each worker will launch python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flags command.
If you are a Dynamo contributor the dynamo run guide for details on how to run this command.
args:
- "python -m dynamo.YOUR_INFERENCE_BACKEND --model YOUR_MODEL --your-flag" resources:
requests:
cpu: "N"
memory: "NGi"
gpu: "N" replicas: N # Number of worker instances args:
- --router-mode
- kv # Enable KV-cache routing args:
- --is-prefill-worker # For disaggregated prefill workersBy default, the Dynamo operator automatically discovers and injects image pull secrets based on container registry host matching. The operator scans Docker config secrets within the same namespace and matches their registry hostnames to the container image URLs, automatically injecting the appropriate secrets into the pod's imagePullSecrets.
Disabling Automatic Discovery: To disable this behavior for a component and manually control image pull secrets:
YourWorker:
dynamoNamespace: your-namespace
componentType: worker
annotations:
nvidia.com/disable-image-pull-secret-discovery: "true"When disabled, you can manually specify secrets as you would for a normal pod spec via:
YourWorker:
dynamoNamespace: your-namespace
componentType: worker
annotations:
nvidia.com/disable-image-pull-secret-discovery: "true"
extraPodSpec:
imagePullSecrets:
- name: my-registry-secret
- name: another-secret
mainContainer:
image: your-imageThis automatic discovery eliminates the need to manually configure image pull secrets for each deployment.
After your base model deployment is running, you can deploy LoRA adapters using the DynamoModel custom resource. This allows you to fine-tune and extend your models without modifying the base deployment.
To add a LoRA adapter to your deployment, link it using modelRef in your worker configuration:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
services:
Worker:
modelRef:
name: Qwen/Qwen3-0.6B # Base model identifier
componentType: worker
# ... rest of worker configThen create a DynamoModel resource for your LoRA:
apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
name: my-lora
spec:
modelName: my-custom-lora
baseModelName: Qwen/Qwen3-0.6B # Must match modelRef.name above
modelType: lora
source:
uri: s3://my-bucket/loras/my-loraFor complete details on managing models and LoRA adapters, see: 📖 Managing Models with DynamoModel Guide