Release Notes
Dynamo v1.0.0 is the first major release of the open-source distributed inference platform. This release delivers production-grade disaggregated serving with comprehensive multimodal and omni-model support, KV cache optimizations, improved handling of agentic workloads, Kubernetes-native deployment at scale, and a stabilized public API.
Summary
Multimodal & Diffusion
Dynamo now serves a range of generative modalities—text, image, and video—across all three major inference frameworks. Text-to-image generation is available through both vLLM Omni and SGLang image diffusion pipelines, and text-to-video through SGLang, vLLM Omni, and TensorRT-LLM Wan T2V, with experimental MJPEG streaming for real-time video output. Encoder disaggregation matured with a new EncoderCacheManager and content-addressed hashing, enabling multimodal encoder outputs to be cached and reused across workers. Embedding transfer between workers uses NIXL to minimize latency, and multimodal-aware KV cache routing places requests based on media content for better cache hit rates.
Agents
Dynamo added building blocks for agentic workloads: agent hints at the API, priority scheduling, and (experimental) KV cache retention and lifecycle awareness for long agent sessions. Dynamo expanded its agentic capabilities with reasoning content management for DeepSeek v3.2, GLM-4.7, and Kimi-2.5—including interleaved thinking support where reasoning and tool calls alternate within a single response. New tool call parsers for GLM-4.7, MiniMax-M2, and Kimi K2/K2.5 broaden the set of models that can drive tool-use workflows. Agentic frameworks that target OpenAI or Anthropic can now connect to Dynamo directly via new /v1/responses and /v1/messages endpoints, removing the need for adapter layers. Guided decoding now enforces JSON schema constraints on model output across vLLM and TensorRT-LLM, ensuring tool calls and function arguments are always valid structured data.
Unified Configuration & Public API Stabilization
All backends (SGLang, TensorRT-LLM, vLLM) and core components (Frontend, Router, Planner) migrated from fragmented argparse flags to a typed, modular configuration system with validated base classes. The public Python API was streamlined—deprecated types like Component, Namespace, and CancellationToken were removed, and endpoint methods were consolidated. These changes make the SDK smaller, more consistent, and easier to maintain.
See Breaking Changes for migration details.
Kubernetes Production Readiness
Dynamo Operator matured with a v1beta1 DynamoGraphDeploymentRequest API (Preview in Dynamo v1.0.0), config versioning via ConfigMap injection, GPU auto-discovery migrated from Profiler to Operator, rolling updates for DGD worker deployments, and simplified CRD management. The EPP component introduced a decomposed pipeline for supporting Inference Gateway-based routing with pod-level traffic management. LoRA support expanded with routing-aware adapter placement, memory-aware allocation, and multimodal LoRA with Kubernetes deployment examples. Multiple new Kubernetes deployment recipes were added including Kimi-K2.5, Qwen3-VL-30B-A3B-FP8, & Nemotron-3-Super-FP8.
Performance & Reliability
Dynamo Snapshot (Preview in Dynamo v1.0.0) enables fast GPU worker recovery via a portable DaemonSet using CRIU and cuda-checkpoint, now extended to SGLang. The Dynamo Planner now adds a load-based scaling approach, and a new GlobalPlanner mode (Preview in Dynamo v1.0.0) that provides cross-deployment autoscaling for multiple models or deployments backing an endpoint. Observability was overhauled with standardized dynamo_router_* metrics, engine-level Prometheus metrics, OTel tracing for routing, and more robust Grafana dashboards.
Under the Hood
Two posts on the Dynamo Dev Blog give a closer look at some of the problems we've worked on:
- Flash Indexer: Inter-Galactic KV Routing traces six iterations of data structure design—from a Python dictionary to a concurrent positional index with jump search. The result: the Dynamo Router sustains 170M ops/s—42x faster than what we shipped in Dynamo v0.1.0 and enough to handle planetary-scale inference workloads (we think).
- Full-Stack Optimizations for Agentic Inference tackles the visibility gap between agent harnesses and inference stacks. Claude Code and Codex know what's urgent—but the inference engines handling the workloads didn't, until now. The new
nvext.agent_hintsAPI lets harnesses pass scheduling priority, cache retention, and speculative prefill hints directly to the engine.
Open-Source Contributions
Between v0.9.0 and v1.0.0, we merged over 700 commits from over 90 contributors — 34 first-time contributors and 19 external contributors from 12 organizations.
First-Time External Contributors
- @devivasudevan (Microsoft) contributed a PR that adds Azure AKS storage guidance for Dynamo caches (#5581).
- @maljazaery (Microsoft) contributed a PR that clarifies DGDSA creation for services is disabled by default (#6389).
- @dsocek (Intel) contributed a PR that improves multimodal disaggregation reliability (#5895).
- @muskansh-google (Google) contributed a PR that updates build commands for the Dynamo + SGLang container (#5908).
- @InfraWhisperer (F5) contributed a PR that fixes a frontend crash when using the TRT-LLM runtime image (#6481).
- @Kaonael (Gcore) contributed a PR that adds a status state enum to DynamoGraphDeployment for improved lifecycle tracking (#6324).
- @Ryan-Amirthan (Fern) contributed a PR that adds standard NVIDIA Fern styling assets to the documentation site (#6148).
- @bledden (Facilitair) contributed a PR that forwards
stream_optionsthrough the multimodal request pipeline (#6474). - @advpropsys (WhiteCircle.ai) contributed a PR that reduces NATS consumer inactive threshold from 1 hour to 2 minutes to prevent stale connections (#5861).
- @luc-hiverge (Hiverge) contributed a PR that fixes first token creation signal timing by emitting the signal after sleeping (#5681).
- @orangeng contributed a PR that fixes the service name in port-forward documentation (#5527).
- @huitianbai contributed a PR that limits bootstrap room ID range to 0–2^63-1 to prevent overflow (#6277).
First-Time NVIDIA Contributors:
- @knowicki-nvidia contributed a PR that adds image diffusion and text-to-image support for the SGLang backend (#5609).
- @akshatha-k contributed a PR that restructures KVBM documentation into a three-tier format (#5905).
- @alexanderbilk contributed a PR that adds a Prometheus port for NIXL telemetry metrics (#5567).
- @rwipfelnv contributed a PR that adds Grafana dashboard and monitoring setup for observability (#4639).
- @mikwieczorek contributed a PR that fixes TRT-LLM recipe component type from "main" to "worker" (#5788).
- @jpohl-nv contributed a PR that adds experimental MJPEG video streaming via
/v1/videos/stream(#6487). - @rafiw contributed a PR that adds Triton path environment variables to the vLLM runtime Dockerfile (#6401).
Returning External Contributors: @michaelfeil (Baseten), @vladnosiv (Yandex.Cloud), @Jont828 (Microsoft), @ashnamehrotra (Microsoft), @ls-2018, @AmeenP (PrimeIntellect), @kerthcet (InftyAI/Hiverge).
If you would like to get involved, please see our Contribution Guide
Breaking Changes
ACTION REQUIRED: The following changes require updates to your code, configuration, or deployment manifests before upgrading.
CLI Flags and Environment Variables
- KV Router Flags Renamed (#6361): All KV router CLI flags and env vars now use the
--router-* /DYN_ROUTER_* prefix.
| Old Flag / Env Var | New Flag / Env Var |
|---|---|
--kv-events / DYN_KV_EVENTS |
--router-kv-events / DYN_ROUTER_USE_KV_EVENTS |
--kv-overlap-score-weight / DYN_KV_OVERLAP_SCORE_WEIGHT |
--router-kv-overlap-score-weight / DYN_ROUTER_KV_OVERLAP_SCORE_WEIGHT |
--assume-kv-reuse / DYN_ASSUME_KV_REUSE |
--router-assume-kv-reuse / DYN_ROUTER_ASSUME_KV_REUSE |
--durable-kv-events / DYN_DURABLE_KV_EVENTS |
--router-durable-kv-events / DYN_ROUTER_DURABLE_KV_EVENTS |
--track-active-blocks / DYN_TRACK_ACTIVE_BLOCKS |
--router-track-active-blocks / DYN_ROUTER_TRACK_ACTIVE_BLOCKS |
--track-output-blocks |
--router-track-output-blocks |
--router-ttl / DYN_ROUTER_TTL |
--router-ttl-secs / DYN_ROUTER_TTL_SECS |
Migrate: Update all CLI invocations, env vars, and deployment YAMLs to use the new names.
-
Disagg Flag Inverted (#6515):
--enforce-disaggreplaced by--decode-fallbackwith inverted semantics — disaggregated mode is now enforced by default.Migrate: Replace
--enforce-disaggwith--decode-fallback. If you need fallback to aggregated mode, explicitly pass--decode-fallbackorDYN_DECODE_FALLBACK=true. In the EPP plugin, update fromDYN_ENFORCE_DISAGGtoDYN_DECODE_FALLBACKwith inverted boolean. -
Migration Limit Moved to Frontend (#5918): The
--migration-limitCLI flag has been removed from all backend workers (vLLM, SGLang, TRT-LLM) and is now set on the Frontend only.Migrate: Remove
--migration-limitfrom backend launch commands; pass it to the Frontend instead. -
Connector Flag Replaced (#6450): The
--connectorflag is removed. Disaggregated prefill workers now require explicit--kv-transfer-configwith a JSON value.Migrate: Replace
--connector nixlwith--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'. Update all deployment YAMLs and launch scripts accordingly. -
KV Events Now Opt-In (#6404): KV cache events are no longer auto-created when prefix caching is enabled. Users must explicitly opt in via
--kv-events-config.Migrate: Add
--kv-events-config '{"publisher":"zmq","endpoint":"tcp://*:20080","enable_kv_cache_events":true}'to worker launch commands. ReplaceDYN_VLLM_KV_EVENT_PORTenv var with the CLI flag. -
Local Indexer Now Default (#5941, #6073): Default event transport changed from JetStream to NATS Core/Event Plane with Local Indexer. The
--enable-local-indexerflag is removed.Migrate: If you relied on JetStream persistence, add
--durable-kv-eventson both frontend and all workers. Remove any--enable-local-indexerflags. -
Omni Flags Prefixed (#6476): 14 diffusion/omni CLI flags renamed with
--omni-prefix (e.g.,--enforce-eager→--omni-enforce-eager).Migrate: Update CLI invocations to use the
--omni-prefixed names. -
Multimodal Worker Flag Removed (#6060):
--multimodal-encode-prefill-workerremoved from vLLM backend.Migrate: Use
--multimodal-encode-worker,--multimodal-worker, or--multimodal-decode-workerinstead. -
Output Modalities Required (#6270): vLLM omni mode no longer auto-registers image endpoints; you must pass
--output-modalities imageexplicitly. -
Media URL Flags Unified (#6391): SGLang/TRT-LLM flags
--image-diffusion-fs-url,--video-generation-fs-url, and--output-dirreplaced by--media-fs-urland--media-base-url. -
Discovery Backend Simplified (#6167):
DYN_DISCOVERY_BACKENDnow acceptskubernetes,etcd,file,memdirectly. RemoveDYN_KV_STORE; replace--store-kvwith--discovery-backend. -
Planner CLI Replaced by Config File (#6356): All individual Planner CLI flags removed in favor of
--config <path>pointing to a JSON/YAML configuration file. -
dynamo-run Removed (#6203): The
dynamo-runCLI tool and all its flags have been removed. Migrate to the Python-based deployment approach.
| Old | New |
|---|---|
DYNAMO_FATBIN_PATH |
DYN_FATBIN_PATH |
ENABLE_KVBM_RECORD |
DYN_KVBM_ENABLE_RECORD |
SPLIT_ENCODE |
DYN_SPLIT_ENCODE |
DYNAMO_BUSY_THRESHOLD |
DYN_BUSY_THRESHOLD |
DYNAMO_* (EPP vars) |
DYN_* |
Prometheus Metrics
- KVStats Metrics Removed (#5704):
dynamo_component_kvstats_* metrics removed. Usedynamo_frontend_inter_token_latency_secondsfor Decode autoscaling instead ofkvstats_gpu_cache_usage_percent. - Router Metrics Namespace (#6227):
dynamo_frontend_worker_active_* →dynamo_router_worker_active_*,dynamo_component_router_*→dynamo_router_*. Newrouter_idlabel added to all Router metrics. - Frontend Request Counter Label (#5568):
dynamo_frontend_requests_totalnow includes anerror_typelabel. Update PromQL queries to account for the new label. - SGLang Metric Prefix (#5701): SGLang metrics now use the native
sglang:prefix (colon) instead ofsglang_(underscore).
Kubernetes
- etcd Subchart Disabled (#6329): Bundled etcd is now disabled by default. Set
global.etcd.install: trueif your deployment depends on it. - Webhook Key Removed (#6441):
webhook.enabledremoved from Helm values. Remove it from custom values files. - Helm Values Restructured for Snapshot (#5946):
storage.signalHostPath,daemonset.criu., anddaemonset.containerRuntimeSocketreplaced byconfig.checkpoint.andconfig.agent.*hierarchy. - DGDR Planner Schema (#6463):
FeaturesSpec.plannerin DGDR CRD changed from a typedPlannerSpecto the PlannerConfig JSON schema. Review DGDR manifests that setfeatures.planner. - EPP Discovery Timeout (#5770):
DYN_DISCOVERY_TIMEOUT_SECno longer works. Use StartupProbefailureThreshold×periodSecondsinstead.
Python SDK
-
Component/Namespace/CancellationToken Removed (#6403, #6386, #6405):
Component,Namespace, andCancellationTokenclasses removed from the Python API.Migrate: Replace
runtime.namespace('ns').component('comp').endpoint('ep')withruntime.endpoint('ns.comp.ep'). Replacetoken.cancel()withHttpService.shutdown(). PassDistributedRuntimedirectly to service.run()methods.
API Renames and Moves:
| Old | New | PR |
|---|---|---|
client2(router_mode) |
client(router_mode=router_mode) |
#6158 |
register_llm / unregister_llm / fetch_llm |
register_model / unregister_model / fetch_model |
#6268 |
ModelDeploymentCard in dynamo.runtime |
Moved to dynamo._internal |
#6378 |
EncoderCacheManager |
MultimodalEmbeddingCacheManager in dynamo.common.memory |
#5962 |
KvPushRouter / ZmqKvEventPublisherConfig |
KvRouter; pass zmq_endpoint/zmq_topic directly to KvEventPublisher() |
#6238 |
ZmqKvEventPublisher |
KvEventPublisher(component, zmq_config=config) |
#6016 |
DYNAMO_ARGS from dynamo.sglang.args |
DynamoSGLangArgGroup from dynamo.sglang.backend_args |
#6280 |
Config from dynamo.trtllm.utils.trtllm_utils / create_worker(...) |
Config in dynamo.trtllm.args / create_llm_worker(...) |
#6297 |
Behavior Changes:
- Frontend Config Refactored (#6201): Frontend CLI now rejects unknown args unless
--chat-processor vllmis set. - ModelManager Checksum Enforcement (#6054): Mismatched MDC checksums across WorkerSets now raise
ChecksumMismatchinstead of being silently accepted. - Tool Call Parser Separation (#5849):
--tool-call-parseralone no longer uses Dynamo's parser. Use--dyn-tool-call-parserfor Dynamo's pipeline. - Custom Backend Metrics Removed (#5893):
custom_backend_metrics_endpointandcustom_backend_metrics_polling_intervalremoved fromLocalModeland frontend config.
Deprecated Components & Features
- Deprecated Component Removals: Removed dynamo-run and mistral-rs engine (#6203), standalone FastAPI Router (#5845), media-nixl feature (#5940), and llava-hf recipes (#6961).
Notable Behavioral Changes
- Local Indexers On By Default (#5941): KV event transport now defaults to NATS Core/Event Plane with Local Indexer instead of JetStream. Pass
--durable-kv-eventson both frontend and workers to restore JetStream behavior. - GPU Memory Utilization (#5755):
gpu-memory-utilizationadjusted for vLLM runtime to improve out-of-the-box performance. - Operator Env Vars Documented (#6548): All environment variables injected by the Operator are now documented.
Deprecated Assets
dynamo-crdsHelm Chart: The standalonedynamo-crdsHelm chart is deprecated. CRDs are now embedded in the Dynamo Operator image and applied automatically via an init container on the operator Deployment (#6466, #6780). Users should uninstall thedynamo-crdsHelm release; the operator manages CRD lifecycle directly.
Future Deprecations
The following features still work but will be removed in a future release with most targeted to Dynamo v1.1.0.
v1alpha1DGDR API (#6352): Thev1alpha1DynamoGraphDeploymentRequest API will be removed in a future release. Migrate tov1beta1; automatic conversion maintains backward compatibility during the transition.- enableGpuDiscovery CRD Field (#6224): The
enableGpuDiscoveryCRD field no longer has any effect and will be removed in a future release. GPU discovery now runs automatically. - ComponentName Field (#6110): The
ComponentNamefield onServiceReplicaStatuswill be removed in a future release. Migrate to the newComponentNameslist field. - Router Legacy Flag Names (#6346): Router CLI flags without the
--router-prefix (e.g.,--block-size,--kv-events) will be removed in a future release. Migrate to the prefixed versions (--router-block-size,--router-kv-events). - vLLM KV Auto-Enable (#6404): vLLM's auto-enabling of KV events when prefix caching is active will be removed in a future release. Use
--kv-events-configexplicitly instead. - Prefill/Decode Worker Flags (#6483): The
--is-prefill-workerand--is-decode-workerboolean flags for the vLLM backend will be removed in a future release. Migrate to--disaggregation-mode. - Durable KV Events (#6477): The
--router-durable-kv-eventsCLI flag will be removed in a future release. Migrate to the event-plane subscriber (local_indexer mode).
Features & Improvements
Multimodal & Diffusion
- Encoder Cache Infrastructure: Implemented EncoderCacheManager with async support for caching multimodal encoder outputs (#5632, #5676) and content-addressed hashing for TensorRT-LLM (#5715).
- TensorRT-LLM Encoder Cache: Integrated encoder cache into TensorRT-LLM PrefillHandler (#5714), EPD workflow (#5780), and E/PD disaggregated workflow (#5815) for cross-worker multimodal reuse.
- vLLM Embedding Cache: Added embedding cache to PD workers (#6029, #6061) and aggregated vLLM nodes (#6153) for reusing multimodal embeddings across requests.
- Text-to-Image Generation: Added text-to-image support via vLLM Omni pipeline (#5608, #5912) and SGLang image diffusion (#5609).
- Text-to-Video Generation: Added text-to-video support via SGLang T2V (#5793), vLLM Omni pipeline (#6104), and TensorRT-LLM Wan T2V (#5926), with experimental MJPEG video streaming via
/v1/videos/stream(#6487). - Multimodal Embedding Transfer: Added embedding transfer sender and receiver for cross-worker multimodal data movement (#6098), adopted transfer classes for the EPD pipeline (#6223), and optimized by keeping embeddings on GPU in the Embedding Sender (#6535).
- Multimodal-Aware KV Cache Routing: Implemented multimodal-aware request routing for vLLM (#6235) and end-to-end multimodal KV cache routing for TensorRT-LLM (#5480), optimizing request placement based on media content.
- TensorRT-LLM Multimodal Preprocessor: Added TensorRT-LLM multimodal preprocessor with backend media decoding (#5910).
- vLLM Frontend Media Decoding: Enabled vLLM backend with frontend media decoding for end-to-end multimodal serving (#5781).
- Batch Image Processing: Added batch image processing in encode worker and Qwen3 model support (#6021).
- SGLang MMEncoder in EPD: Integrated SGLang MMEncoder for multimodal EPD encode worker pipeline (#6162).
- Multimodal Model Support: Improved multimodal disaggregation reliability with Qwen2.5 VL 32B support (#5895) and added Qwen3-VL-30B-A3B support for EPD pipeline (#6533).
- NIXL WRITE Embedding Transfer: Added NIXL WRITE initiation for cross-node multimodal embedding transfer (#6776).
- vLLM Omni Container: Installed vllm-omni in vLLM container for visual generation support (#6458).
Frontend & Agents
- Reasoning and Tool Call Parsers: Added reasoning content management for DeepSeek v3.2, GLM-4.7, and Kimi-2.5 (#6107), interleaved thinking support (#6422), and new tool/reasoning parsers for GLM-4.7 (#5897), MiniMax-M2 (#6294), and Kimi K2/K2.5 (#6407).
- Responses API Compliance: Implemented Responses API compliance with upstream type alignment for spec conformance (#6089).
- Anthropic Messages Endpoint: Added Anthropic Messages API endpoint (
/v1/messages) for cross-provider compatibility (#6231). - Tiktoken Support: Added Tiktoken tokenizer support for models requiring Tiktoken encoding (#6460).
- vLLM Chat Path Optimization: Reduced Python-side overhead in the vLLM chat path for lower latency (#6437).
- vLLM Pre/Post Processing: Adopted vLLM for pre- and post-processing in the Frontend for consistency (#5544).
- Dynamic gRPC Startup: Made gRPC startup dynamic for high ISL/OSL scenarios in the gRPC Frontend (#5536).
Kubernetes Deployment
DynamoGraphDeployment Request
- DGDR Deployment Guide: Added comprehensive Kubernetes deployment guide for DynamoGraphDeploymentRequest (DGDR) workflows covering the golden path from model selection through profiling and autoscaling (#7304).
- DGDR API Maturation: Added structured
.status.stateenums for DGD (#6324) and DGDR (#6396),observedGenerationfor reconciliation tracking (#6398), introduced thev1beta1DGDR API with automatic conversion fromv1alpha1(#6352), and adoptedv1beta1in the controller (#6498). - Model/WorkerSet Architecture: Introduced hierarchical Model/WorkerSet architecture for multi-namespace support (#6054).
- Rolling Updates: Implemented managed rolling updates for DGD worker deployments (#6110).
- DGD Print Columns: Added print columns with ready condition for
v1alpha1API types like DGD (#5542). - Optional DGDR Image Field: Made the image field optional in DGDRs for flexible container configuration (#6557).
- Operator Version in DGD: Included Operator version in DGD for version tracking in cluster state (#6121).
- AIC DGD Generation: Enabled AIC DGD generation call for automated infrastructure configuration (#6216).
- Profiler Job Overrides: Added profiler job overrides for customizable profiling runs (#6641).
Dynamo Snapshot
- Dynamo Snapshot: Introduced Dynamo Snapshot for fast GPU worker recovery (#4978, #7068), refactored configuration with
/dev/shmsupport and mount-policy rewrite (#5946), added external restore with signal-based IPC (#6286), and extended to the SGLang backend (#6594).
Gateway API Inference Endpoint (GAIE)
- EPP Integration: Added the EPP component for Kubernetes Gateway API-based inference routing (#5611), implemented the decomposed pipeline for flexible routing stages (#5446), added startup probe for reliable liveness detection (#5770), and enabled the EPP
podsinterface for pod-level traffic management (#6302).
Dynamo Operator
- Operator Management Improvements: Implemented config versioning via ConfigMap injection (#6464), simplified CRD management (#6466), reduced Helm chart dependencies (#7048), and replaced kube-rbac-proxy with controller-runtime authorization (#7069).
- GPU Discovery Migration: Migrated GPU discovery from Dynamo Profiler to Operator with automatic injection (#6224).
- Namespace-Scoped GPU Discovery: Added optional GPU discovery for namespace-scoped Operators (#6343).
- Tolerations and Affinity Support: Added tolerations and affinity support for all platform Helm chart components (#5561).
- Rolling Updates Documentation: Added documentation for Operator rolling updates (#6541).
Scheduling
Router
- Data-Parallel Routing: Added per-DP-rank gap detection (#5873), TensorRT-LLM DP rank routing (#5936), and RNG tiebreaking for DP routing targets (#6253) for improved data-parallel load distribution.
- Router Priority Queue: Implemented request priority queue in the Router (#6010) and plumbed priority through SGLang and vLLM handlers for end-to-end support (#6348).
- Global Router: Added global Router for hierarchical Planner topology (#5697).
- Global Router + vLLM Example: Added DGD example for global Router + vLLM deployment (#5760).
- Expert Routing Info: Enabled returning routed experts info through SGLang for expert-parallel routing visibility (#6137).
- Prefill Tokens Threshold: Added prefill tokens threshold based on max batched tokens fraction for adaptive batching (#5867).
- Default Event Threads: Defaulted
router_event_threadsto 4 for improved Router throughput (#6724).
Planner
- Planner Autoscaling: Added GlobalPlanner component for centralized cross-cluster scaling (#5702), implemented load-based scaling in SLA Planner (#6145), added throughput metrics source for disaggregated scaling decisions (#6500), and moved core logic from DPP to AIC with static profiling support (#6285).
- Planner P/D Separation: Separated Planner into independent prefill/decode Planners (#5622) and automated resource allocation by deriving GPU counts (#5919) and worker counts (#5934) from DGD status.
- Planner Config Migration: Migrated Planner from argparse CLI to config file for unified configuration (#6356).
- Planner Schema in DGDR: Added Planner schema to DGDR and Profiler input for configuration consistency (#6463).
- Profiler Model Validation: Removed default model name in Profiler and added validation for served model name or path (#5950).
KV Block Manager
- Speculative Prefill: Implemented speculative prefill for proactive KV cache population (#6230).
- Flash Indexer Optimizations: Optimized flash indexer performance for faster KV cache prefix lookups (#6305).
- Standalone KV Indexer: Added standalone KV indexer with query endpoint for decoupled prefix matching (#6446).
- KVBM Priority Offload: Implemented priority-based KV cache offload filtering (#5563) and optimized by reading the priority env var once at init (#5798).
- KVBM Logical Abstraction: Introduced KVBM-logical abstraction layer for flexible KV block management (#6033).
- Nested KV Index Mapper: Implemented nested mapper for KV indexing to support hierarchical prefix matching (#5785).
- KVBM Memory Enhancements: Added KVBM memory management enhancements for improved allocation and lifecycle (#5532).
- Default KVBM Enablement: Enabled lib/memory, media-nixl, and KVBM by default for out-of-the-box disaggregated serving (#5602).
- KVBM Kernels Crate: Added
kvbm-kernelscrate and upgraded cudarc to 0.19 for GPU kernel support (#6309). - NVTX Annotations: Added NVTX annotations to KVBM for GPU profiling visibility (#6334).
- Default KV Events Config: Defaulted
kv-events-configto empty to align with vLLM defaults (#6404). - KV Hit Rate Histogram: Exposed predicted KV hit rate as Prometheus histogram for cache efficiency monitoring (#6507).
- Mocker KV Cache Tracing: Added optional KV cache allocation/eviction tracing (#6052, #6207), KV transfer latency simulation for disaggregated benchmarks (#6504), and ZMQ-based KV event publishing (#6528) to the mocker.
LoRA Support
- LoRA Routing and Allocation: Added LoRA-aware routing hints and tracking (#5875), memory-aware load estimation (#5880), HRW-based optimal adapter allocation (#5992), and LoRA-aware KV cache events (#6517).
- Multimodal LoRA: Extended LoRA support to multimodal workloads with protocol-level model identification (#6382), request handling for multimodal workers (#6399), and deployment examples for local (#6400) and Kubernetes (#6452).
Infrastructure Modernization
- Unified Configuration System: Introduced a unified configuration system with typed base classes (#5975) and migrated vLLM (#6075) and Frontend CLI (#6201) to the new system.
- Configuration System Migration: Migrated SGLang (#6280), TensorRT-LLM (#6297), global Router (#6342), and Router (#6346) to the unified configuration system.
- Go-to-Definition Support: Enabled go-to-definition for
dynamo.runtime,dynamo.nixl, and external dependencies (#6026). - Standardized Error Type: Introduced standardized Dynamo error type for consistent error handling across the stack (#6303).
- AIPerf Client Rate Control: Added
--request-rateand--request-rate-modeflags to aiper client for flexible load testing (#6585). - Disaggregation Mode Enum: Added
--disaggregation-modeenum to vLLM backend for explicit mode selection (#6483). - vLLM Endpoint Flag: Added
--endpointflag support todynamo.vllmfor flexible serving configuration (#6360).
Performance
- Mocker Performance: Improved mocker with model pre-fetching, staggered launches, and timing accuracy (#5871, #5808, #6100), and modularized the crate into common/scheduler/kv_manager/cache modules (#6440).
SGLang
- SGLang GPU Memory Service: Integrated SGLang with GPU Memory Service for unified memory management (#5664).
- SGLang Request Migration: Implemented request migration for SGLang to support live request handoff (#5659).
- SGLang Weight Update Endpoints: Added SGLang
/engineweight update endpoints for online model updates (#6094).
TensorRT-LLM
- TensorRT-LLM Guided Decoding: Added guided decoding backend config and choice support for TensorRT-LLM (#5762).
- CUDA IPC for TensorRT-LLM: Introduced CUDA IPC for TensorRT-LLM PrefillHandler enabling zero-copy cross-process transfers (#5773).
- NixlConnector Config: Added
--kv-transfer-config NixlConnectorto disaggregated scripts and recipes (#6560).
vLLM
- vLLM Multi-Node Multiprocessing: Adopted vLLM multiprocessing in multi-node scenarios for improved parallelism (#6191).
- Headless Multi-Node Mode: Added
--headlessmode for multi-node TP/PP indynamo.vllmfor worker-only deployments (#6204). - ModelExpress P2P Weight Transfer: Enabled ModelExpress P2P weight transfer in Dynamo vLLM worker for faster model loading (#6186).
Fault Tolerance & Observability
- Router Metrics and Tracing: Added per-worker load monitoring (#5842), centralized Router-level request tracking (#6146), standardized all Router metrics under the
dynamo_router_* namespace (#6227), and added OTel tracing for routing overheads (#6194). - Engine Prometheus Metrics: Exposed Python-level engine metrics via LLMComponentMetrics (#5817), added auto/custom label injection (#5989), introduced tokenizer (#6092) and detokenization (#6160) latency metrics, and exposed TensorRT-LLM kv_cache metrics (#6469).
- NIXL Telemetry Port: Added NIXL Telemetry Prometheus port for transfer library monitoring (#5567).
- Error Type Metric Label: Added
error_typelabel to request metrics for fine-grained error classification (#5568). - Grafana Dashboard: Added Grafana dashboard and monitoring setup for comprehensive observability (#4639).
- NIXL Sanity Check: Added NIXL availability check to sanity_check for environment validation (#6087).
- Graceful Shutdown Draining: Enabled backends to accept new requests during shutdown grace period for graceful draining (#6093).
Recipes
- GB200 Disagg Recipe: Added GB200 GPT-oss disaggregated serving recipe for next-gen hardware support (#4954).
- DeepSeek V3.2 Recipe: Added DeepSeek V3.2 TensorRT-LLM recipe for optimized serving (#6969).
- Qwen3-VL-30B Recipe: Added Qwen3-VL-30B recipe for aggregated and encoder cache deployment with vLLM (#7191).
Bug Fixes
Multimodal
- Multimodal Disaggregated Serving: Fixed multiple reliability issues in multimodal prefill/decode disaggregated serving and restored EPD pipeline on single-GPU (#5951, #6753, #6978).
- Multimodal Input Processing: Fixed multimodal input loader blocking the async event loop, PSD file crash in the image pipeline, and vLLM OmniModel image processing performance (#5945, #6212, #6451).
- Multimodal API and Stream Handling: Fixed
stream_optionsforwarding through the multimodal request pipeline, CLI flag collisions with--omni-prefixes, andnormalize_finish_reasonon the OmniHandler (#6474, #6476, #6896). - Multimodal Cross-Node Transfer: Fixed encode + prefill/decode flow in TensorRT-LLM for multimodal embedding transfer (#6790).
- Multimodal Video and Audio: Fixed vLLM chat processor to correctly handle video and audio inputs and resolved invalid UUID errors from empty multimodal inputs (#6708, #6904).
- Multimodal Router Performance: Fixed duplicate image downloads and unnecessary image processing in the multimodal Router for vLLM, reducing latency for repeated media content (#7172).
- Multimodal Pipeline Fixes: Fixed multiple minor issues in the vLLM multimodal pipeline, worker service registration collisions, Llama 4 aggregated multimodal launch script, and LLaVA model EPD support (#5748, #5986, #6103, #6765).
Frontend & Agents
- LoRA Endpoint Reliability: Fixed LoRA load/unload endpoints silently swallowing errors and extended S3 download timeouts to prevent failures with large adapter files (#5626, #6986).
- Request Sampling Parameters: Fixed request sampling parameters not being forwarded to backend workers, causing generation settings to be silently ignored (#5797).
- Reasoning Token Handling: Fixed reasoning parser propagation from worker runtime config, interleaved reasoning content ordering, and reasoning content being dropped when a tool-call starts mid-stream (#6300, #6442, #7051).
- Chat Template and Model Fixes: Fixed DeepSeek V3.2 chat template for function calling and structured output, Nemotron Nano model to use the correct reasoning parser (#6034, #6288), and added
force_nonempty_contentfor Nemotron models (#7225). - Frontend Stability: Fixed HTTP request cancellation using a temporary token instead of the real cancellation token, and fixed a Frontend crash when running with the TensorRT-LLM runtime image (#6344, #6481).
- Model Endpoint Correctness: Fixed
/v1/modelsendpoint exposing inactive models and model name resolution to prefer--served-model-name(#5881, #7021). - Responses API Compatibility: Fixed Responses API rejecting valid assistant
output_textmessages that lackedid/statusfields (#7049). - vLLM Processor Compatibility: Fixed vLLM processor compatibility with vLLM 0.16 API changes and incorrect output when
stream_intervalis greater than 1 (#6873, #6874). - Prompt Length Validation: Fixed missing validation for prompts exceeding
max_seq_len, now returning HTTP 400 instead of silently failing (#6997). - Guided Grammar Depth Limit: Fixed guided grammar to reject schemas with excessive nesting depth, preventing potential resource exhaustion (#7135).
Kubernetes Deployment
DynamoGraphDeployment Request
- DGD/DGDR Configuration: Fixed DGD cross-selection, fallback for missing
subComponentType, service name length validation for DNS compliance, name sanitization for DNS-1035, DGDR prefix for naive fallback (#5449, #6113, #6317, #7062, #6679), and strippedapiVersion/kind/metadatafromoverrides.dgdbefore merging (#7121). - Operator Override Ordering: Fixed DGD overrides to apply before running interpolation, ensuring tolerations propagate correctly (#7226).
Dynamo Snapshot
- Snapshot Checkpoint/Restore: Fixed Snapshot checkpoint failure handling to use SIGKILL, multi-GPU UUID mapping, restore to correctly pass the checkpoint path (#6478, #6492, #7018), and snapshot children before process group kill to prevent GPU memory leaks (#7232).
Dynamo Operator
- Helm Chart Reliability: Disabled etcd subchart by default, restored Helm docs autogeneration, and reverted a template change that caused deployment failures (#5739, #6329, #6459).
- Operator Stability: Fixed restart state tracking for parallel restarts,
DynamoComponentReadycondition updates,imagePullPolicyapplication, etcd cleanup logic, and consolidated discovery backend configuration (#4821, #5051, #5949, #6263, #6167). - Multi-Node Deployment Fixes: Fixed SSH setup for TensorRT-LLM multi-node workers, unquoted mpirun and Ray leader arguments that caused multi-node failures, and added nodeSelector support (#6225, #6248, #6711).
- Operator GPU Discovery and Tolerations: Fixed GPU discovery preflight job, correct storage of GPU-equipped nodes, propagation of tolerations with auto-discovered GPU limits, and PVC block emission in configmap (#6640, #6714, #6979, #6755).
- Operator CRD and API Configuration: Fixed CRD validation for nil/empty containers,
AutoApplyfield type for proper nil handling, webhook version matching forv1alpha1DGDR, annotation propagation, EPP config plugin weight support (#6255, #6712, #6808, #6718, #6783), and allowedx-kubernetes-preserve-unknown-fieldsin CRD validation (#7128).
Scheduling
Router
- Router Startup Race Condition: Fixed race condition between worker discovery and runtime config discovery in the KV Router that caused routing failures on startup (#5924).
- Router Stream Panics: Fixed stream handling in the Router that caused panics when polling after stream termination (#5872).
- Router Data-Parallel Routing: Fixed Router to correctly pass the data-parallel rank into the vLLM engine and corrected KV Router discovery name derivation (#6014, #6475).
- Router Scheduling Backpressure: Fixed scheduling by folding it into the queue so backpressure propagates correctly (#6470).
- Router Metrics Collection: Fixed
RouterRequestMetricsavailability to ensure Router metrics are always collected (#6558).
Planner
- Profiler Timeout and Crash Fixes: Fixed profiler deployment timeout handling for large MoE models and config generation to strip None arguments that caused crashes (#6086, #6887).
- Profiler DGDR Validation: Fixed DGDR validator and DGD generation in the profiler, improved service name logging (#6876, #6112), and fixed profiling condition updates to populate results and clear phase after completion (#7195).
- Planner CLI Configuration: Fixed
disagg_planner.yamland Planner test configs to use the updated CLI format (#6775, #7041, #7042). - Planner Backend Resolution: Fixed propagation of resolved backend and skipped interpolation for aggregated deployments (#7142).
- Profiler TTFT/ITL Default Handling: Fixed Profiler validation error by using
model_fields_setto distinguish TTFT/ITL default usage (#6827).
KV Block Manager
- KV Cache Sleep/Wake Stability: Fixed KV cache block allocation signal after sleep/wake cycles and CUDA synchronization race conditions during GPU memory transitions (#5681, #5759).
- KV Event Propagation and Block Management: Fixed KV event propagation for data-parallel multi-node deployments and KVBM to read block size from vLLM at runtime instead of using a hardcoded value (#5589, #5713, #5851).
- KV Cache Memory Leak: Fixed memory leak where KV cache blocks were not freed on stream drop (#6246).
- GMS Reliability: Fixed GMS CLI startup failure, removed unnecessary CUDA synchronize calls that degraded performance, and fixed GMS socket UUID resolution via the CUDA driver API (#5749, #6362, #6914).
- KVBM CUDA Device Handling: Fixed PinnedAllocator to use the correct
device_id, KVBM to respectCUDA_VISIBLE_DEVICESfor NUMA binding,device_blocksdouble-counting in the TensorRT-LLM connector, and added authorization guards to memory occupation control endpoints (#6877, #6950, #6406, #7023).
Performance
SGLang
- SGLang Metrics and Monitoring: Fixed metrics prefix format from
sglang_ tosglang:andTokenizerMetricsCollectorlazy-import to avoid collector registration errors (#5701, #6269). - SGLang Configuration Fixes: Fixed tool-call-parser flags to prevent configuration conflicts and DeepSeek-R1 recipe with watchdog timeout to prevent hangs (#5849, #6076).
- SGLang Decode Handler: Fixed decode handler to ignore empty non-final stream chunks (#6304).
- SGLang Build and API: Fixed container build conflict by removing
python3-blinkerand corrected multimodal item keys in the SGLang API (#5995, #5981).
TensorRT-LLM
- TensorRT-LLM Stability: Fixed decode worker stability by temporarily disabling request cancellation and eliminated crashes caused by unsafe
abort()calls (#5764, #5827). - TensorRT-LLM Multimodal Support: Fixed multimodal flag being silently ignored, multimodal hash support for TRT-LLM 1.3
apply_mm_hashesAPI, skipped encoder LLM creation for unsupported models (#6468, #6907, #6918), and fixed the multimodal preprocessor after the initial approach was reverted (#6920, #6993). - TensorRT-LLM Guided Decoding: Fixed handler to properly convert guided decoding dictionaries to
GuidedDecodingParams(#6127). - TensorRT-LLM Multi-Node Deployment: Fixed multi-node worker SSH crash in non-root containers and removed deprecated
beam_widthparameter from health check (#6772, #6890).
vLLM
- vLLM Worker Stability: Fixed worker graceful shutdown to prevent orphaned processes, decode worker logging format that caused CrashLoopBackOff, and worker registration for external/hybrid load balancing (#5818, #6267, #6833).
- vLLM Disaggregated Serving: Fixed disaggregated serving by adding missing
--is-decode-workerand--kv-transfer-configflags (#5843, #6554). - vLLM Launch Script Fixes: Fixed DeepSeek-R1 recipe checkpoint path, removed an unnecessary bash wrapper, and corrected launch scripts for disaggregated and speculative decoding (#5721, #6035, #6562).
- vLLM Stream Handling: Fixed sampling parameter parsing in the EPD flow (#5813).
- vLLM Performance Configuration: Fixed Docker image to use the CUDA sampler for better performance and corrected engine stats logging (#5613, #6566).
- vLLM Multi-Worker Port Collisions: Fixed HTTP port collisions when multiple workers share a process (#7185).
Build & Container
- Runtime Image Fixes: Fixed missing native libraries (nvlink, UCX, NIXL, CRT, Triton paths), corrected image tags across SGLang, TensorRT-LLM, and vLLM Dockerfiles (#6503, #6521, #6958, #6983, #6401), and updated UCX reference for performance (#7218).
- TensorRT-LLM Dependency Fixes: Fixed missing
msgpackdependency and pinnedpydantic-settingsbelow 2.13.0 for compatibility (#5799, #6339). - Build System Fixes: Fixed cross-platform NUMA module compilation,
ai-dynamo-runtimewheel packaging to exclude NIXL shared libraries, CI containerGIT_COMMIT_SHApopulation, and disabledmedia-ffmpegfeature by default (#6354, #6881, #7016, #6574).
Other
- Core Infrastructure Fixes: Fixed ZMQ transport receive timeout to prevent hangs, Prometheus metric collisions via multi-registry scrape, stale NATS consumers, multi-node Slurm launch arguments, performance degradation from excessive logging in the EPD pipeline, and tool call validation (#5685, #5678, #5948, #5861, #6742, #5504).
Documentation
New Content
- AKS Storage Guidance: Added Azure AKS storage guidance for Dynamo caches (#5581).
- TensorRT-LLM Known Issues: Added known issues section for TensorRT-LLM backend (#5801).
- Mocker Documentation: Added mocker component documentation (#5832).
- GPU Memory Service: Added overview documentation for GPU Memory Service (#5920).
- Disaggregated Serving Guide: Added disaggregated serving guide (#6024).
- Quick Start Sections: Added quick start sections to KVBM and Router guides (#6043).
- KVBM Disaggregated Setup: Added instructions for TensorRT-LLM KVBM disaggregated setup (#6055).
- Architecture Docs: Added Discovery Plane documentation and refactored Event Plane with D2 diagrams (#6229).
- Inference Gateway: Added inference gateway documentation page (#6319).
- Agent Docs: Added agent readme and documentation (#6320).
- Frontend Configuration: Documented Frontend requirement for model config file access (#6327).
- Speculative Prefill Demo: Added multiturn_bench README with speculative prefill demo (#6502).
- Dev Containers Troubleshooting: Documented Docker 29.x Dev Containers hang root cause and fix (#6505).
- KV Indexer Docs: Added standalone KV indexer documentation (#6511).
- Embedding Cache: Documented embedding cache support in vLLM and TensorRT-LLM (#6555).
- SGLang Observability: Expanded SGLang observability guide with tracing and dashboards (#6556).
- DGDR
v1beta1: Documentedv1beta1DynamoGPUDynamicResource API (#6713). - vLLM Multimodal Router: Added docs for vLLM multimodal Router (#6568).
- Nemotron-3-Super-FP8 Recipes: Added Nemotron-3-Super-FP8 deployment recipes for SGLang aggregated, SGLang disaggregated, and TensorRT-LLM disaggregated with model download manifests (#7254).
- FastVideo Example and Guide: Added FastVideo text-to-video example with deployment guide and sidebar reorganization (#7283).
- Getting Started Introduction: Added introduction page to the Getting Started section with platform overview (#7292).
New Release Artifacts
snapshot-agentContainer: New container image for the Dynamo Snapshot agent. Runs as a privileged DaemonSet that uses CRIU and cuda-checkpoint to snapshot and restore GPU worker processes, enabling fast recovery without model reload. Pairs with thesnapshotHelm chart for deployment (Preview in v1.0.0).snapshotHelm Chart: New Helm chart for deploying the Dynamo Snapshot DaemonSet and its supporting resources (ConfigMaps, RBAC, signal host paths). Manages the lifecycle of the snapshot-agent across cluster nodes (Preview in v1.0.0).dynamo-mockerCrate: New Rust crate that simulates inference engine behavior — token generation timing, KV cache allocation, and transfer latency — without requiring GPU hardware. Used for benchmarking Router and Planner behavior, testing disaggregated pipelines, and validating scaling policies.dynamo-kv-routerCrate: New standalone Rust crate for KV-aware request routing. Extracts the Router's prefix-matching, load-balancing, and KV cache event processing into a reusable library for disaggregated serving deployments.
For the full list of Dynamo v1.0.0 release artifacts, see: Release Artifacts.
Version Upgrades
Major Dependencies
- SGLang 0.5.9: Upgraded SGLang to 0.5.9 with updated documentation (#6518).
- TensorRT-LLM 1.3.0rc5.post1: Upgraded TensorRT-LLM.post1 from 1.2.0rc6.post2 to 1.3.0rc5 through intermediate release candidates (#5700, #6402, #6579), including major stability improvements and bug fixes (#6495).
- vLLM 0.16.0: Upgraded vLLM from 0.12 to 0.16.0 through intermediate releases (0.14.1, 0.15.1), including compilation config updates for each version (#5691, #5819, #6102, #6652).
- NIXL 0.10.1: Upgraded NIXL from 0.9.x to 0.10.1 with transfer library improvements (#6701, #6832).
- AI Configurator 0.7.0: Upgraded AI Configurator to 0.7.0 (#6494, #6634, #6791, #6975, #7071, #7050).
- AIPerf 0.6.0.post1: Upgraded AIPerf to 0.6.0.post1 with first-class integration as Dynamo's benchmarking framework and added guides for benchmarking and Router A/B testing (#5982, #7138, #7155, #7203).
Other Dependencies
- Grove 0.1.0-alpha.6: Updated grove dependency to 0.1.0-alpha.6 for Snapshot integration (#6015).
- Minor Dependency Upgrades: Bumped Rust oneshot to 0.1.13 (#5694), updated AWS SDK (#6878), and Go OTEL SDK to v1.40.0 (#6906).
For a list of dependencies for Dynamo v1.0.0 and past releases, see our Support Matrix.
Known Issues
DynamoGraphDeploymentRequest (Preview in v1.0.0)
Planner With Empty Defaults Fails on Non-AIC-Supported Model/Hardware
Applying a DGDR with features.planner: {} (empty defaults) on a model/hardware combination not supported by AIConfigurator causes the profiling job to fail with ValueError: Throughput-based planner scaling requires AIC support. The default planner config assumes throughput scaling with rapid in-depth sweeping, which requires AIC support. The Dynamo profiler validation raises a hard error before AIC is called, even though AIC PR#516 added the backend-side fix.
Workaround: Set
features.planner: {pre_deployment_sweeping_mode: thorough}to bypass the AIC support gate check.
Profiler Rejects Valid SLA Combination
Specifying both optimizationType and ttft/itl SLA targets on a DynamoGraphDeploymentRequest triggers a Pydantic validation error because the schema treats them as mutually exclusive. The optimizationType field is not yet implemented in Dynamo 1.0.0, and any CRDs or manifests that reference it will fail validation. Users who upgrade from earlier versions with existing DGDR specs that include optimizationType alongside latency targets will see immediate admission errors.
Workaround: Remove the
optimizationTypefield from SLA specifications. Use onlye2eLatencyor thettft/itlpair (which must be specified together) — these two modes are mutually exclusive.
Interpolation Does Not Propagate Tolerations
Tolerations defined in overrides.dgd on a DynamoGraphDeploymentRequest are not propagated to candidate DynamoGraphDeployments created during the interpolation phase of profiling. This causes worker pods to remain in Pending state on clusters with tainted nodes, because the generated deployments lack the required tolerations to schedule onto those nodes. PR #7226 moved override application to before the interpolation step, but the fix is incomplete for all override paths and has been reopened. A complete fix is pending for a patch release.
Workaround: Manually add the required tolerations directly to each generated
DynamoGraphDeploymentafter interpolation completes, or remove taints from target nodes during profiling.
Thorough Profiler Generates Infeasible TP=1 for MoE Models
The profiler's memory estimation does not account for WideEP communication buffers used by Mixture-of-Experts models, causing it to generate TP=1 configurations that are guaranteed to OOM at runtime. When the thorough profiler enumerates candidate configurations, it underestimates peak memory for MoE architectures, and the resulting deployment crashes immediately upon loading the model.
Workaround: Manually reduce
kv_cache_ratioto approximately 0.75 in the profiler configuration to reserve headroom for WideEP buffers, or exclude TP=1 from the candidate search space by setting a minimum tensor parallelism degree.
Infeasible SLA Targets Silently Accepted
When a user specifies SLA targets (TTFT, ITL, or E2E latency) that cannot be met by any profiled configuration, the profiler logs a warning but does not surface it as a Kubernetes condition on the DynamoGraphDeploymentRequest status. Operators monitoring the DGDR via kubectl or cluster dashboards will see no indication that the requested SLAs are unachievable, leading to deployments that run but never meet their performance objectives. This issue has been moved to the backlog and will not be fixed in 1.0.0.
Workaround: After profiling completes, manually inspect profiler pod logs for warnings containing "infeasible" or "no valid configuration" to verify that the requested SLA targets are achievable.
Multimodal
TRT-LLM Disaggregated Multimodal Raises AttributeError
Running the disaggregated embeddings/prefill/decode pipeline (diagg_e_pd.sh) with TRT-LLM on multimodal models raises AttributeError: 'NoneType' object has no attribute 'keys' during input preprocessing. The root cause is that TRT-LLM does not support the token IDs and multimodal embeddings path in its LLM API; the preprocessor must fall back to passing a text prompt via default_multimodal_input_loader for the embeddings case. A fix was merged (#6840) and cherry-picked as #6920 in RC6, but the fix regressed and the issue persists in the v1.0.0 release.
Workaround: Use aggregated mode instead of disaggregated embeddings/prefill/decode for TRT-LLM multimodal workloads. A corrected fix is planned for a follow-up patch release.
Wan2.1 Video Diffusion Requires Manual imageio Install
Deploying Wan-AI/Wan2.1-T2V-1.3B-Diffusers for text-to-video generation fails with ModuleNotFoundError: No module named 'imageio'. The imageio package is intentionally excluded from the TRT-LLM runtime container to reduce image size, as video generation is an experimental feature. This is documented in docs/backends/trtllm/trtllm-video-diffusion.md.
Workaround: Install the package manually inside the container:
pip install imageio imageio-ffmpeg.
Embeddings Cache with TensorRT-LLM and enable_block_reuse
Deploying a TensorRT-LLM multimodal workflow with Embeddings Cache and enable_block_reuse: true is not supported due to limitations in the backend. This will be supported in upcoming releases.
Workaround: Use Embeddings Cache with
enable_block_reuse: false. All existing recipes, benchmarks, and guides already reflect this configuration.
Dynamo Snapshot
Snapshot Restore Fails on AKS for vLLM
Snapshot restore of vLLM workers on AKS does not fully reinitialize model state. A single restored worker appears healthy and passes readiness checks but returns empty responses with no generated tokens. Restoring multiple workers simultaneously can hang, causing inference requests to time out. This issue has only been observed on AKS.
Workaround: No workaround available. Fix planned for a follow-up patch release.
KVBM
Pinned Memory Allocation Failure on Blackwell GPUs
KVBM initialization may fail on Blackwell GPUs (GB200, B100, B200) with CUDA_ERROR_INVALID_VALUE when allocating pinned host memory. The root cause is that the PinnedAllocator was hardcoded to device_id 0 instead of using the actual device ID, which causes NUMA binding to select the wrong memory node. A partial fix (#6809) corrects the device ID in the allocator, but some Blackwell configurations may still encounter initialization failures depending on the NUMA topology.
Workaround: Ensure
CUDA_VISIBLE_DEVICESis set to expose only the intended GPUs, and verify that the NUMA node assignment matches the GPU topology.
Performance Degradation When KVBM Is Enabled
Enabling KVBM may degrade inference performance compared to running without it — observed in vLLM disaggregated mode and TensorRT-LLM aggregated mode. KVBM is now enabled by default (#5602), so users may see lower throughput out of the box. The overhead comes from KV cache block management and transfer coordination, which adds latency to each request even when KV cache reuse rates are low.
Workaround: Disable KVBM by unsetting
DYN_KVBM_ENABLEif KV cache sharing is not needed for your workload.
SGLang
HiCache NIXL Storage Backend Crash on Init
SGLang HiCache with --hicache-storage-backend nixl crashes during scheduler initialization with TypeError: expected str, bytes or os.PathLike object, not MHATokenToKVPoolHost. The HiCacheNixl backend passes the memory pool host object where a file path string is expected. This is an upstream SGLang bug, fixed in sgl-project/sglang#19517 but not yet included in the SGLang version pinned by Dynamo 1.0.0.
Workaround: Use a different HiCache storage backend (e.g.,
disk). HiCache works correctly with non-NIXL backends.
SGLang DSR1 Recipe Model Loading from PVC Failure
Deploying the SGLang DSR1 recipe or using it as a base config in the SLA profiler may fail because the model-download script downloads the model into a non-standard HuggingFace directory that ModelExpress cannot load, causing prefill and decode workers to enter CrashLoopBackOff.
Workaround: (1) Download the HF model into a standard HF directory and set
HF_HOMEto the PVC-mounted path, (2) update--model-pathto point at the directory containing the downloaded HF cache (not supported for SLA profiler), or (3) provideHF_TOKENso the model can be downloaded directly.
TensorRT-LLM
Qwen3‑235B‑A22B‑FP8 fails with CuTe Experimental NotImplementedError on Blackwell
Deploying the qwen3-235b-a22b-fp8 recipes (both agg and disagg) on GB200/Blackwell fails at runtime with: NotImplementedError: CuTe Experimental module is only supported on Cuda toolkit 13.1 and above!
Workaround: This is caused by a packaging mismatch in the container image: the
nvidia-cutlass-dsl==4.3.4wheel baked into the image is the CUDA < 13.1 variant that stubs outcutlass.cute.experimentalby unconditionally raisingNotImplementedError, while the image itself ships CUDA 13.1 and TensorRT‑LLM’s Blackwell FP8 GEMM path (cute_dsl_fp8_gemm_blackwell) requirescute.experimentalto be functional
Looking Ahead
Dynamo v1.1.0 is targeted for April 29, 2026. Here's a small preview of what's already taking shape:
Longer Contexts, Lower Cost
FlexKV manages KV cache across HBM, host memory, and SSD so long-context and high-concurrency workloads don't hit GPU memory limits. Instead of dropping requests when HBM fills up, the system spills KV blocks to cheaper storage tiers and pulls them back on demand.
Resilient Routing at Scale
The KV indexer gains P2P state recovery and automatic ZMQ gap replay, keeping prefix matching correct through node failures without manual intervention. Multi-model and multi-tenant isolation ensures that shared clusters route requests to the right cache even when multiple models share the same infrastructure.
Unified Observability
Forward pass metrics on the event plane and Loki log aggregation with unified OTLP ingestion bring metrics, traces, and logs into a single pipeline. Operators debugging latency or throughput issues in disaggregated deployments no longer need to correlate data across separate tools.
Full Changelog: v0.9.1...v1.0.0