Skip to content

Dynamo Release v1.0.0

Latest

Choose a tag to compare

@dagil-nvidia dagil-nvidia released this 13 Mar 20:57
b1818dc

Release Notes

Dynamo v1.0.0 is the first major release of the open-source distributed inference platform. This release delivers production-grade disaggregated serving with comprehensive multimodal and omni-model support, KV cache optimizations, improved handling of agentic workloads, Kubernetes-native deployment at scale, and a stabilized public API.

Summary

Multimodal & Diffusion

Dynamo now serves a range of generative modalities—text, image, and video—across all three major inference frameworks. Text-to-image generation is available through both vLLM Omni and SGLang image diffusion pipelines, and text-to-video through SGLang, vLLM Omni, and TensorRT-LLM Wan T2V, with experimental MJPEG streaming for real-time video output. Encoder disaggregation matured with a new EncoderCacheManager and content-addressed hashing, enabling multimodal encoder outputs to be cached and reused across workers. Embedding transfer between workers uses NIXL to minimize latency, and multimodal-aware KV cache routing places requests based on media content for better cache hit rates.

Agents

Dynamo added building blocks for agentic workloads: agent hints at the API, priority scheduling, and (experimental) KV cache retention and lifecycle awareness for long agent sessions. Dynamo expanded its agentic capabilities with reasoning content management for DeepSeek v3.2, GLM-4.7, and Kimi-2.5—including interleaved thinking support where reasoning and tool calls alternate within a single response. New tool call parsers for GLM-4.7, MiniMax-M2, and Kimi K2/K2.5 broaden the set of models that can drive tool-use workflows. Agentic frameworks that target OpenAI or Anthropic can now connect to Dynamo directly via new /v1/responses and /v1/messages endpoints, removing the need for adapter layers. Guided decoding now enforces JSON schema constraints on model output across vLLM and TensorRT-LLM, ensuring tool calls and function arguments are always valid structured data.

Unified Configuration & Public API Stabilization

All backends (SGLang, TensorRT-LLM, vLLM) and core components (Frontend, Router, Planner) migrated from fragmented argparse flags to a typed, modular configuration system with validated base classes. The public Python API was streamlined—deprecated types like Component, Namespace, and CancellationToken were removed, and endpoint methods were consolidated. These changes make the SDK smaller, more consistent, and easier to maintain.

See Breaking Changes for migration details.

Kubernetes Production Readiness

Dynamo Operator matured with a v1beta1 DynamoGraphDeploymentRequest API (Preview in Dynamo v1.0.0), config versioning via ConfigMap injection, GPU auto-discovery migrated from Profiler to Operator, rolling updates for DGD worker deployments, and simplified CRD management. The EPP component introduced a decomposed pipeline for supporting Inference Gateway-based routing with pod-level traffic management. LoRA support expanded with routing-aware adapter placement, memory-aware allocation, and multimodal LoRA with Kubernetes deployment examples. Multiple new Kubernetes deployment recipes were added including Kimi-K2.5, Qwen3-VL-30B-A3B-FP8, & Nemotron-3-Super-FP8.

Performance & Reliability

Dynamo Snapshot (Preview in Dynamo v1.0.0) enables fast GPU worker recovery via a portable DaemonSet using CRIU and cuda-checkpoint, now extended to SGLang. The Dynamo Planner now adds a load-based scaling approach, and a new GlobalPlanner mode (Preview in Dynamo v1.0.0) that provides cross-deployment autoscaling for multiple models or deployments backing an endpoint. Observability was overhauled with standardized dynamo_router_* metrics, engine-level Prometheus metrics, OTel tracing for routing, and more robust Grafana dashboards.

Under the Hood

Two posts on the Dynamo Dev Blog give a closer look at some of the problems we've worked on:

  1. Flash Indexer: Inter-Galactic KV Routing traces six iterations of data structure design—from a Python dictionary to a concurrent positional index with jump search. The result: the Dynamo Router sustains 170M ops/s—42x faster than what we shipped in Dynamo v0.1.0 and enough to handle planetary-scale inference workloads (we think).
  2. Full-Stack Optimizations for Agentic Inference tackles the visibility gap between agent harnesses and inference stacks. Claude Code and Codex know what's urgent—but the inference engines handling the workloads didn't, until now. The new nvext.agent_hints API lets harnesses pass scheduling priority, cache retention, and speculative prefill hints directly to the engine.

Open-Source Contributions

Between v0.9.0 and v1.0.0, we merged over 700 commits from over 90 contributors — 34 first-time contributors and 19 external contributors from 12 organizations.

First-Time External Contributors

  • @devivasudevan (Microsoft) contributed a PR that adds Azure AKS storage guidance for Dynamo caches (#5581).
  • @maljazaery (Microsoft) contributed a PR that clarifies DGDSA creation for services is disabled by default (#6389).
  • @dsocek (Intel) contributed a PR that improves multimodal disaggregation reliability (#5895).
  • @muskansh-google (Google) contributed a PR that updates build commands for the Dynamo + SGLang container (#5908).
  • @InfraWhisperer (F5) contributed a PR that fixes a frontend crash when using the TRT-LLM runtime image (#6481).
  • @Kaonael (Gcore) contributed a PR that adds a status state enum to DynamoGraphDeployment for improved lifecycle tracking (#6324).
  • @Ryan-Amirthan (Fern) contributed a PR that adds standard NVIDIA Fern styling assets to the documentation site (#6148).
  • @bledden (Facilitair) contributed a PR that forwards stream_options through the multimodal request pipeline (#6474).
  • @advpropsys (WhiteCircle.ai) contributed a PR that reduces NATS consumer inactive threshold from 1 hour to 2 minutes to prevent stale connections (#5861).
  • @luc-hiverge (Hiverge) contributed a PR that fixes first token creation signal timing by emitting the signal after sleeping (#5681).
  • @orangeng contributed a PR that fixes the service name in port-forward documentation (#5527).
  • @huitianbai contributed a PR that limits bootstrap room ID range to 0–2^63-1 to prevent overflow (#6277).

First-Time NVIDIA Contributors:

  • @knowicki-nvidia contributed a PR that adds image diffusion and text-to-image support for the SGLang backend (#5609).
  • @akshatha-k contributed a PR that restructures KVBM documentation into a three-tier format (#5905).
  • @alexanderbilk contributed a PR that adds a Prometheus port for NIXL telemetry metrics (#5567).
  • @rwipfelnv contributed a PR that adds Grafana dashboard and monitoring setup for observability (#4639).
  • @mikwieczorek contributed a PR that fixes TRT-LLM recipe component type from "main" to "worker" (#5788).
  • @jpohl-nv contributed a PR that adds experimental MJPEG video streaming via /v1/videos/stream (#6487).
  • @rafiw contributed a PR that adds Triton path environment variables to the vLLM runtime Dockerfile (#6401).

Returning External Contributors: @michaelfeil (Baseten), @vladnosiv (Yandex.Cloud), @Jont828 (Microsoft), @ashnamehrotra (Microsoft), @ls-2018, @AmeenP (PrimeIntellect), @kerthcet (InftyAI/Hiverge).

If you would like to get involved, please see our Contribution Guide

Breaking Changes

ACTION REQUIRED: The following changes require updates to your code, configuration, or deployment manifests before upgrading.

CLI Flags and Environment Variables

  • KV Router Flags Renamed (#6361): All KV router CLI flags and env vars now use the --router-* / DYN_ROUTER_* prefix.
Old Flag / Env Var New Flag / Env Var
--kv-events / DYN_KV_EVENTS --router-kv-events / DYN_ROUTER_USE_KV_EVENTS
--kv-overlap-score-weight / DYN_KV_OVERLAP_SCORE_WEIGHT --router-kv-overlap-score-weight / DYN_ROUTER_KV_OVERLAP_SCORE_WEIGHT
--assume-kv-reuse / DYN_ASSUME_KV_REUSE --router-assume-kv-reuse / DYN_ROUTER_ASSUME_KV_REUSE
--durable-kv-events / DYN_DURABLE_KV_EVENTS --router-durable-kv-events / DYN_ROUTER_DURABLE_KV_EVENTS
--track-active-blocks / DYN_TRACK_ACTIVE_BLOCKS --router-track-active-blocks / DYN_ROUTER_TRACK_ACTIVE_BLOCKS
--track-output-blocks --router-track-output-blocks
--router-ttl / DYN_ROUTER_TTL --router-ttl-secs / DYN_ROUTER_TTL_SECS

Migrate: Update all CLI invocations, env vars, and deployment YAMLs to use the new names.

  • Disagg Flag Inverted (#6515): --enforce-disagg replaced by --decode-fallback with inverted semantics — disaggregated mode is now enforced by default.

    Migrate: Replace --enforce-disagg with --decode-fallback. If you need fallback to aggregated mode, explicitly pass --decode-fallback or DYN_DECODE_FALLBACK=true. In the EPP plugin, update from DYN_ENFORCE_DISAGG to DYN_DECODE_FALLBACK with inverted boolean.

  • Migration Limit Moved to Frontend (#5918): The --migration-limit CLI flag has been removed from all backend workers (vLLM, SGLang, TRT-LLM) and is now set on the Frontend only.

    Migrate: Remove --migration-limit from backend launch commands; pass it to the Frontend instead.

  • Connector Flag Replaced (#6450): The --connector flag is removed. Disaggregated prefill workers now require explicit --kv-transfer-config with a JSON value.

    Migrate: Replace --connector nixl with --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'. Update all deployment YAMLs and launch scripts accordingly.

  • KV Events Now Opt-In (#6404): KV cache events are no longer auto-created when prefix caching is enabled. Users must explicitly opt in via --kv-events-config.

    Migrate: Add --kv-events-config '{"publisher":"zmq","endpoint":"tcp://*:20080","enable_kv_cache_events":true}' to worker launch commands. Replace DYN_VLLM_KV_EVENT_PORT env var with the CLI flag.

  • Local Indexer Now Default (#5941, #6073): Default event transport changed from JetStream to NATS Core/Event Plane with Local Indexer. The --enable-local-indexer flag is removed.

    Migrate: If you relied on JetStream persistence, add --durable-kv-events on both frontend and all workers. Remove any --enable-local-indexer flags.

  • Omni Flags Prefixed (#6476): 14 diffusion/omni CLI flags renamed with --omni- prefix (e.g., --enforce-eager--omni-enforce-eager).

    Migrate: Update CLI invocations to use the --omni- prefixed names.

  • Multimodal Worker Flag Removed (#6060): --multimodal-encode-prefill-worker removed from vLLM backend.

    Migrate: Use --multimodal-encode-worker, --multimodal-worker, or --multimodal-decode-worker instead.

  • Output Modalities Required (#6270): vLLM omni mode no longer auto-registers image endpoints; you must pass --output-modalities image explicitly.

  • Media URL Flags Unified (#6391): SGLang/TRT-LLM flags --image-diffusion-fs-url, --video-generation-fs-url, and --output-dir replaced by --media-fs-url and --media-base-url.

  • Discovery Backend Simplified (#6167): DYN_DISCOVERY_BACKEND now accepts kubernetes, etcd, file, mem directly. Remove DYN_KV_STORE; replace --store-kv with --discovery-backend.

  • Planner CLI Replaced by Config File (#6356): All individual Planner CLI flags removed in favor of --config <path> pointing to a JSON/YAML configuration file.

  • dynamo-run Removed (#6203): The dynamo-run CLI tool and all its flags have been removed. Migrate to the Python-based deployment approach.

  • Env Var Renames (#6358, #5882):

Old New
DYNAMO_FATBIN_PATH DYN_FATBIN_PATH
ENABLE_KVBM_RECORD DYN_KVBM_ENABLE_RECORD
SPLIT_ENCODE DYN_SPLIT_ENCODE
DYNAMO_BUSY_THRESHOLD DYN_BUSY_THRESHOLD
DYNAMO_* (EPP vars) DYN_*

Prometheus Metrics

  • KVStats Metrics Removed (#5704): dynamo_component_kvstats_* metrics removed. Use dynamo_frontend_inter_token_latency_seconds for Decode autoscaling instead of kvstats_gpu_cache_usage_percent.
  • Router Metrics Namespace (#6227): dynamo_frontend_worker_active_* → dynamo_router_worker_active_*, dynamo_component_router_*dynamo_router_*. New router_id label added to all Router metrics.
  • Frontend Request Counter Label (#5568): dynamo_frontend_requests_total now includes an error_type label. Update PromQL queries to account for the new label.
  • SGLang Metric Prefix (#5701): SGLang metrics now use the native sglang: prefix (colon) instead of sglang_ (underscore).

Kubernetes

  • etcd Subchart Disabled (#6329): Bundled etcd is now disabled by default. Set global.etcd.install: true if your deployment depends on it.
  • Webhook Key Removed (#6441): webhook.enabled removed from Helm values. Remove it from custom values files.
  • Helm Values Restructured for Snapshot (#5946): storage.signalHostPath, daemonset.criu., and daemonset.containerRuntimeSocket replaced by config.checkpoint. and config.agent.* hierarchy.
  • DGDR Planner Schema (#6463): FeaturesSpec.planner in DGDR CRD changed from a typed PlannerSpec to the PlannerConfig JSON schema. Review DGDR manifests that set features.planner.
  • EPP Discovery Timeout (#5770): DYN_DISCOVERY_TIMEOUT_SEC no longer works. Use StartupProbe failureThreshold × periodSeconds instead.

Python SDK

  • Component/Namespace/CancellationToken Removed (#6403, #6386, #6405): Component, Namespace, and CancellationToken classes removed from the Python API.

    Migrate: Replace runtime.namespace('ns').component('comp').endpoint('ep') with runtime.endpoint('ns.comp.ep'). Replace token.cancel() with HttpService.shutdown(). Pass DistributedRuntime directly to service .run() methods.

API Renames and Moves:

Old New PR
client2(router_mode) client(router_mode=router_mode) #6158
register_llm / unregister_llm / fetch_llm register_model / unregister_model / fetch_model #6268
ModelDeploymentCard in dynamo.runtime Moved to dynamo._internal #6378
EncoderCacheManager MultimodalEmbeddingCacheManager in dynamo.common.memory #5962
KvPushRouter / ZmqKvEventPublisherConfig KvRouter; pass zmq_endpoint/zmq_topic directly to KvEventPublisher() #6238
ZmqKvEventPublisher KvEventPublisher(component, zmq_config=config) #6016
DYNAMO_ARGS from dynamo.sglang.args DynamoSGLangArgGroup from dynamo.sglang.backend_args #6280
Config from dynamo.trtllm.utils.trtllm_utils / create_worker(...) Config in dynamo.trtllm.args / create_llm_worker(...) #6297

Behavior Changes:

  • Frontend Config Refactored (#6201): Frontend CLI now rejects unknown args unless --chat-processor vllm is set.
  • ModelManager Checksum Enforcement (#6054): Mismatched MDC checksums across WorkerSets now raise ChecksumMismatch instead of being silently accepted.
  • Tool Call Parser Separation (#5849): --tool-call-parser alone no longer uses Dynamo's parser. Use --dyn-tool-call-parser for Dynamo's pipeline.
  • Custom Backend Metrics Removed (#5893): custom_backend_metrics_endpoint and custom_backend_metrics_polling_interval removed from LocalModel and frontend config.

Deprecated Components & Features

  • Deprecated Component Removals: Removed dynamo-run and mistral-rs engine (#6203), standalone FastAPI Router (#5845), media-nixl feature (#5940), and llava-hf recipes (#6961).

Notable Behavioral Changes

  • Local Indexers On By Default (#5941): KV event transport now defaults to NATS Core/Event Plane with Local Indexer instead of JetStream. Pass --durable-kv-events on both frontend and workers to restore JetStream behavior.
  • GPU Memory Utilization (#5755): gpu-memory-utilization adjusted for vLLM runtime to improve out-of-the-box performance.
  • Operator Env Vars Documented (#6548): All environment variables injected by the Operator are now documented.

Deprecated Assets

  • dynamo-crds Helm Chart: The standalone dynamo-crds Helm chart is deprecated. CRDs are now embedded in the Dynamo Operator image and applied automatically via an init container on the operator Deployment (#6466, #6780). Users should uninstall the dynamo-crds Helm release; the operator manages CRD lifecycle directly.

Future Deprecations

The following features still work but will be removed in a future release with most targeted to Dynamo v1.1.0.

  • v1alpha1 DGDR API (#6352): The v1alpha1 DynamoGraphDeploymentRequest API will be removed in a future release. Migrate to v1beta1; automatic conversion maintains backward compatibility during the transition.
  • enableGpuDiscovery CRD Field (#6224): The enableGpuDiscovery CRD field no longer has any effect and will be removed in a future release. GPU discovery now runs automatically.
  • ComponentName Field (#6110): The ComponentName field on ServiceReplicaStatus will be removed in a future release. Migrate to the new ComponentNames list field.
  • Router Legacy Flag Names (#6346): Router CLI flags without the --router- prefix (e.g., --block-size, --kv-events) will be removed in a future release. Migrate to the prefixed versions (--router-block-size, --router-kv-events).
  • vLLM KV Auto-Enable (#6404): vLLM's auto-enabling of KV events when prefix caching is active will be removed in a future release. Use --kv-events-config explicitly instead.
  • Prefill/Decode Worker Flags (#6483): The --is-prefill-worker and --is-decode-worker boolean flags for the vLLM backend will be removed in a future release. Migrate to --disaggregation-mode.
  • Durable KV Events (#6477): The --router-durable-kv-events CLI flag will be removed in a future release. Migrate to the event-plane subscriber (local_indexer mode).

Features & Improvements

Multimodal & Diffusion

  • Encoder Cache Infrastructure: Implemented EncoderCacheManager with async support for caching multimodal encoder outputs (#5632, #5676) and content-addressed hashing for TensorRT-LLM (#5715).
  • TensorRT-LLM Encoder Cache: Integrated encoder cache into TensorRT-LLM PrefillHandler (#5714), EPD workflow (#5780), and E/PD disaggregated workflow (#5815) for cross-worker multimodal reuse.
  • vLLM Embedding Cache: Added embedding cache to PD workers (#6029, #6061) and aggregated vLLM nodes (#6153) for reusing multimodal embeddings across requests.
  • Text-to-Image Generation: Added text-to-image support via vLLM Omni pipeline (#5608, #5912) and SGLang image diffusion (#5609).
  • Text-to-Video Generation: Added text-to-video support via SGLang T2V (#5793), vLLM Omni pipeline (#6104), and TensorRT-LLM Wan T2V (#5926), with experimental MJPEG video streaming via /v1/videos/stream (#6487).
  • Multimodal Embedding Transfer: Added embedding transfer sender and receiver for cross-worker multimodal data movement (#6098), adopted transfer classes for the EPD pipeline (#6223), and optimized by keeping embeddings on GPU in the Embedding Sender (#6535).
  • Multimodal-Aware KV Cache Routing: Implemented multimodal-aware request routing for vLLM (#6235) and end-to-end multimodal KV cache routing for TensorRT-LLM (#5480), optimizing request placement based on media content.
  • TensorRT-LLM Multimodal Preprocessor: Added TensorRT-LLM multimodal preprocessor with backend media decoding (#5910).
  • vLLM Frontend Media Decoding: Enabled vLLM backend with frontend media decoding for end-to-end multimodal serving (#5781).
  • Batch Image Processing: Added batch image processing in encode worker and Qwen3 model support (#6021).
  • SGLang MMEncoder in EPD: Integrated SGLang MMEncoder for multimodal EPD encode worker pipeline (#6162).
  • Multimodal Model Support: Improved multimodal disaggregation reliability with Qwen2.5 VL 32B support (#5895) and added Qwen3-VL-30B-A3B support for EPD pipeline (#6533).
  • NIXL WRITE Embedding Transfer: Added NIXL WRITE initiation for cross-node multimodal embedding transfer (#6776).
  • vLLM Omni Container: Installed vllm-omni in vLLM container for visual generation support (#6458).

Frontend & Agents

  • Reasoning and Tool Call Parsers: Added reasoning content management for DeepSeek v3.2, GLM-4.7, and Kimi-2.5 (#6107), interleaved thinking support (#6422), and new tool/reasoning parsers for GLM-4.7 (#5897), MiniMax-M2 (#6294), and Kimi K2/K2.5 (#6407).
  • Responses API Compliance: Implemented Responses API compliance with upstream type alignment for spec conformance (#6089).
  • Anthropic Messages Endpoint: Added Anthropic Messages API endpoint (/v1/messages) for cross-provider compatibility (#6231).
  • Tiktoken Support: Added Tiktoken tokenizer support for models requiring Tiktoken encoding (#6460).
  • vLLM Chat Path Optimization: Reduced Python-side overhead in the vLLM chat path for lower latency (#6437).
  • vLLM Pre/Post Processing: Adopted vLLM for pre- and post-processing in the Frontend for consistency (#5544).
  • Dynamic gRPC Startup: Made gRPC startup dynamic for high ISL/OSL scenarios in the gRPC Frontend (#5536).

Kubernetes Deployment

DynamoGraphDeployment Request

  • DGDR Deployment Guide: Added comprehensive Kubernetes deployment guide for DynamoGraphDeploymentRequest (DGDR) workflows covering the golden path from model selection through profiling and autoscaling (#7304).
  • DGDR API Maturation: Added structured .status.state enums for DGD (#6324) and DGDR (#6396), observedGeneration for reconciliation tracking (#6398), introduced the v1beta1 DGDR API with automatic conversion from v1alpha1 (#6352), and adopted v1beta1 in the controller (#6498).
  • Model/WorkerSet Architecture: Introduced hierarchical Model/WorkerSet architecture for multi-namespace support (#6054).
  • Rolling Updates: Implemented managed rolling updates for DGD worker deployments (#6110).
  • DGD Print Columns: Added print columns with ready condition for v1alpha1 API types like DGD (#5542).
  • Optional DGDR Image Field: Made the image field optional in DGDRs for flexible container configuration (#6557).
  • Operator Version in DGD: Included Operator version in DGD for version tracking in cluster state (#6121).
  • AIC DGD Generation: Enabled AIC DGD generation call for automated infrastructure configuration (#6216).
  • Profiler Job Overrides: Added profiler job overrides for customizable profiling runs (#6641).

Dynamo Snapshot

  • Dynamo Snapshot: Introduced Dynamo Snapshot for fast GPU worker recovery (#4978, #7068), refactored configuration with /dev/shm support and mount-policy rewrite (#5946), added external restore with signal-based IPC (#6286), and extended to the SGLang backend (#6594).

Gateway API Inference Endpoint (GAIE)

  • EPP Integration: Added the EPP component for Kubernetes Gateway API-based inference routing (#5611), implemented the decomposed pipeline for flexible routing stages (#5446), added startup probe for reliable liveness detection (#5770), and enabled the EPP pods interface for pod-level traffic management (#6302).

Dynamo Operator

  • Operator Management Improvements: Implemented config versioning via ConfigMap injection (#6464), simplified CRD management (#6466), reduced Helm chart dependencies (#7048), and replaced kube-rbac-proxy with controller-runtime authorization (#7069).
  • GPU Discovery Migration: Migrated GPU discovery from Dynamo Profiler to Operator with automatic injection (#6224).
  • Namespace-Scoped GPU Discovery: Added optional GPU discovery for namespace-scoped Operators (#6343).
  • Tolerations and Affinity Support: Added tolerations and affinity support for all platform Helm chart components (#5561).
  • Rolling Updates Documentation: Added documentation for Operator rolling updates (#6541).

Scheduling

Router

  • Data-Parallel Routing: Added per-DP-rank gap detection (#5873), TensorRT-LLM DP rank routing (#5936), and RNG tiebreaking for DP routing targets (#6253) for improved data-parallel load distribution.
  • Router Priority Queue: Implemented request priority queue in the Router (#6010) and plumbed priority through SGLang and vLLM handlers for end-to-end support (#6348).
  • Global Router: Added global Router for hierarchical Planner topology (#5697).
  • Global Router + vLLM Example: Added DGD example for global Router + vLLM deployment (#5760).
  • Expert Routing Info: Enabled returning routed experts info through SGLang for expert-parallel routing visibility (#6137).
  • Prefill Tokens Threshold: Added prefill tokens threshold based on max batched tokens fraction for adaptive batching (#5867).
  • Default Event Threads: Defaulted router_event_threads to 4 for improved Router throughput (#6724).

Planner

  • Planner Autoscaling: Added GlobalPlanner component for centralized cross-cluster scaling (#5702), implemented load-based scaling in SLA Planner (#6145), added throughput metrics source for disaggregated scaling decisions (#6500), and moved core logic from DPP to AIC with static profiling support (#6285).
  • Planner P/D Separation: Separated Planner into independent prefill/decode Planners (#5622) and automated resource allocation by deriving GPU counts (#5919) and worker counts (#5934) from DGD status.
  • Planner Config Migration: Migrated Planner from argparse CLI to config file for unified configuration (#6356).
  • Planner Schema in DGDR: Added Planner schema to DGDR and Profiler input for configuration consistency (#6463).
  • Profiler Model Validation: Removed default model name in Profiler and added validation for served model name or path (#5950).

KV Block Manager

  • Speculative Prefill: Implemented speculative prefill for proactive KV cache population (#6230).
  • Flash Indexer Optimizations: Optimized flash indexer performance for faster KV cache prefix lookups (#6305).
  • Standalone KV Indexer: Added standalone KV indexer with query endpoint for decoupled prefix matching (#6446).
  • KVBM Priority Offload: Implemented priority-based KV cache offload filtering (#5563) and optimized by reading the priority env var once at init (#5798).
  • KVBM Logical Abstraction: Introduced KVBM-logical abstraction layer for flexible KV block management (#6033).
  • Nested KV Index Mapper: Implemented nested mapper for KV indexing to support hierarchical prefix matching (#5785).
  • KVBM Memory Enhancements: Added KVBM memory management enhancements for improved allocation and lifecycle (#5532).
  • Default KVBM Enablement: Enabled lib/memory, media-nixl, and KVBM by default for out-of-the-box disaggregated serving (#5602).
  • KVBM Kernels Crate: Added kvbm-kernels crate and upgraded cudarc to 0.19 for GPU kernel support (#6309).
  • NVTX Annotations: Added NVTX annotations to KVBM for GPU profiling visibility (#6334).
  • Default KV Events Config: Defaulted kv-events-config to empty to align with vLLM defaults (#6404).
  • KV Hit Rate Histogram: Exposed predicted KV hit rate as Prometheus histogram for cache efficiency monitoring (#6507).
  • Mocker KV Cache Tracing: Added optional KV cache allocation/eviction tracing (#6052, #6207), KV transfer latency simulation for disaggregated benchmarks (#6504), and ZMQ-based KV event publishing (#6528) to the mocker.

LoRA Support

  • LoRA Routing and Allocation: Added LoRA-aware routing hints and tracking (#5875), memory-aware load estimation (#5880), HRW-based optimal adapter allocation (#5992), and LoRA-aware KV cache events (#6517).
  • Multimodal LoRA: Extended LoRA support to multimodal workloads with protocol-level model identification (#6382), request handling for multimodal workers (#6399), and deployment examples for local (#6400) and Kubernetes (#6452).

Infrastructure Modernization

  • Unified Configuration System: Introduced a unified configuration system with typed base classes (#5975) and migrated vLLM (#6075) and Frontend CLI (#6201) to the new system.
  • Configuration System Migration: Migrated SGLang (#6280), TensorRT-LLM (#6297), global Router (#6342), and Router (#6346) to the unified configuration system.
  • Go-to-Definition Support: Enabled go-to-definition for dynamo.runtime, dynamo.nixl, and external dependencies (#6026).
  • Standardized Error Type: Introduced standardized Dynamo error type for consistent error handling across the stack (#6303).
  • AIPerf Client Rate Control: Added --request-rate and --request-rate-mode flags to aiper client for flexible load testing (#6585).
  • Disaggregation Mode Enum: Added --disaggregation-mode enum to vLLM backend for explicit mode selection (#6483).
  • vLLM Endpoint Flag: Added --endpoint flag support to dynamo.vllm for flexible serving configuration (#6360).

Performance

  • Mocker Performance: Improved mocker with model pre-fetching, staggered launches, and timing accuracy (#5871, #5808, #6100), and modularized the crate into common/scheduler/kv_manager/cache modules (#6440).

SGLang

  • SGLang GPU Memory Service: Integrated SGLang with GPU Memory Service for unified memory management (#5664).
  • SGLang Request Migration: Implemented request migration for SGLang to support live request handoff (#5659).
  • SGLang Weight Update Endpoints: Added SGLang /engine weight update endpoints for online model updates (#6094).

TensorRT-LLM

  • TensorRT-LLM Guided Decoding: Added guided decoding backend config and choice support for TensorRT-LLM (#5762).
  • CUDA IPC for TensorRT-LLM: Introduced CUDA IPC for TensorRT-LLM PrefillHandler enabling zero-copy cross-process transfers (#5773).
  • NixlConnector Config: Added --kv-transfer-config NixlConnector to disaggregated scripts and recipes (#6560).

vLLM

  • vLLM Multi-Node Multiprocessing: Adopted vLLM multiprocessing in multi-node scenarios for improved parallelism (#6191).
  • Headless Multi-Node Mode: Added --headless mode for multi-node TP/PP in dynamo.vllm for worker-only deployments (#6204).
  • ModelExpress P2P Weight Transfer: Enabled ModelExpress P2P weight transfer in Dynamo vLLM worker for faster model loading (#6186).

Fault Tolerance & Observability

  • Router Metrics and Tracing: Added per-worker load monitoring (#5842), centralized Router-level request tracking (#6146), standardized all Router metrics under the dynamo_router_* namespace (#6227), and added OTel tracing for routing overheads (#6194).
  • Engine Prometheus Metrics: Exposed Python-level engine metrics via LLMComponentMetrics (#5817), added auto/custom label injection (#5989), introduced tokenizer (#6092) and detokenization (#6160) latency metrics, and exposed TensorRT-LLM kv_cache metrics (#6469).
  • NIXL Telemetry Port: Added NIXL Telemetry Prometheus port for transfer library monitoring (#5567).
  • Error Type Metric Label: Added error_type label to request metrics for fine-grained error classification (#5568).
  • Grafana Dashboard: Added Grafana dashboard and monitoring setup for comprehensive observability (#4639).
  • NIXL Sanity Check: Added NIXL availability check to sanity_check for environment validation (#6087).
  • Graceful Shutdown Draining: Enabled backends to accept new requests during shutdown grace period for graceful draining (#6093).

Recipes

  • GB200 Disagg Recipe: Added GB200 GPT-oss disaggregated serving recipe for next-gen hardware support (#4954).
  • DeepSeek V3.2 Recipe: Added DeepSeek V3.2 TensorRT-LLM recipe for optimized serving (#6969).
  • Qwen3-VL-30B Recipe: Added Qwen3-VL-30B recipe for aggregated and encoder cache deployment with vLLM (#7191).

Bug Fixes

Multimodal

  • Multimodal Disaggregated Serving: Fixed multiple reliability issues in multimodal prefill/decode disaggregated serving and restored EPD pipeline on single-GPU (#5951, #6753, #6978).
  • Multimodal Input Processing: Fixed multimodal input loader blocking the async event loop, PSD file crash in the image pipeline, and vLLM OmniModel image processing performance (#5945, #6212, #6451).
  • Multimodal API and Stream Handling: Fixed stream_options forwarding through the multimodal request pipeline, CLI flag collisions with --omni- prefixes, and normalize_finish_reason on the OmniHandler (#6474, #6476, #6896).
  • Multimodal Cross-Node Transfer: Fixed encode + prefill/decode flow in TensorRT-LLM for multimodal embedding transfer (#6790).
  • Multimodal Video and Audio: Fixed vLLM chat processor to correctly handle video and audio inputs and resolved invalid UUID errors from empty multimodal inputs (#6708, #6904).
  • Multimodal Router Performance: Fixed duplicate image downloads and unnecessary image processing in the multimodal Router for vLLM, reducing latency for repeated media content (#7172).
  • Multimodal Pipeline Fixes: Fixed multiple minor issues in the vLLM multimodal pipeline, worker service registration collisions, Llama 4 aggregated multimodal launch script, and LLaVA model EPD support (#5748, #5986, #6103, #6765).

Frontend & Agents

  • LoRA Endpoint Reliability: Fixed LoRA load/unload endpoints silently swallowing errors and extended S3 download timeouts to prevent failures with large adapter files (#5626, #6986).
  • Request Sampling Parameters: Fixed request sampling parameters not being forwarded to backend workers, causing generation settings to be silently ignored (#5797).
  • Reasoning Token Handling: Fixed reasoning parser propagation from worker runtime config, interleaved reasoning content ordering, and reasoning content being dropped when a tool-call starts mid-stream (#6300, #6442, #7051).
  • Chat Template and Model Fixes: Fixed DeepSeek V3.2 chat template for function calling and structured output, Nemotron Nano model to use the correct reasoning parser (#6034, #6288), and added force_nonempty_content for Nemotron models (#7225).
  • Frontend Stability: Fixed HTTP request cancellation using a temporary token instead of the real cancellation token, and fixed a Frontend crash when running with the TensorRT-LLM runtime image (#6344, #6481).
  • Model Endpoint Correctness: Fixed /v1/models endpoint exposing inactive models and model name resolution to prefer --served-model-name (#5881, #7021).
  • Responses API Compatibility: Fixed Responses API rejecting valid assistant output_text messages that lacked id/status fields (#7049).
  • vLLM Processor Compatibility: Fixed vLLM processor compatibility with vLLM 0.16 API changes and incorrect output when stream_interval is greater than 1 (#6873, #6874).
  • Prompt Length Validation: Fixed missing validation for prompts exceeding max_seq_len, now returning HTTP 400 instead of silently failing (#6997).
  • Guided Grammar Depth Limit: Fixed guided grammar to reject schemas with excessive nesting depth, preventing potential resource exhaustion (#7135).

Kubernetes Deployment

DynamoGraphDeployment Request

  • DGD/DGDR Configuration: Fixed DGD cross-selection, fallback for missing subComponentType, service name length validation for DNS compliance, name sanitization for DNS-1035, DGDR prefix for naive fallback (#5449, #6113, #6317, #7062, #6679), and stripped apiVersion/kind/metadata from overrides.dgd before merging (#7121).
  • Operator Override Ordering: Fixed DGD overrides to apply before running interpolation, ensuring tolerations propagate correctly (#7226).

Dynamo Snapshot

  • Snapshot Checkpoint/Restore: Fixed Snapshot checkpoint failure handling to use SIGKILL, multi-GPU UUID mapping, restore to correctly pass the checkpoint path (#6478, #6492, #7018), and snapshot children before process group kill to prevent GPU memory leaks (#7232).

Dynamo Operator

  • Helm Chart Reliability: Disabled etcd subchart by default, restored Helm docs autogeneration, and reverted a template change that caused deployment failures (#5739, #6329, #6459).
  • Operator Stability: Fixed restart state tracking for parallel restarts, DynamoComponentReady condition updates, imagePullPolicy application, etcd cleanup logic, and consolidated discovery backend configuration (#4821, #5051, #5949, #6263, #6167).
  • Multi-Node Deployment Fixes: Fixed SSH setup for TensorRT-LLM multi-node workers, unquoted mpirun and Ray leader arguments that caused multi-node failures, and added nodeSelector support (#6225, #6248, #6711).
  • Operator GPU Discovery and Tolerations: Fixed GPU discovery preflight job, correct storage of GPU-equipped nodes, propagation of tolerations with auto-discovered GPU limits, and PVC block emission in configmap (#6640, #6714, #6979, #6755).
  • Operator CRD and API Configuration: Fixed CRD validation for nil/empty containers, AutoApply field type for proper nil handling, webhook version matching for v1alpha1 DGDR, annotation propagation, EPP config plugin weight support (#6255, #6712, #6808, #6718, #6783), and allowed x-kubernetes-preserve-unknown-fields in CRD validation (#7128).

Scheduling

Router

  • Router Startup Race Condition: Fixed race condition between worker discovery and runtime config discovery in the KV Router that caused routing failures on startup (#5924).
  • Router Stream Panics: Fixed stream handling in the Router that caused panics when polling after stream termination (#5872).
  • Router Data-Parallel Routing: Fixed Router to correctly pass the data-parallel rank into the vLLM engine and corrected KV Router discovery name derivation (#6014, #6475).
  • Router Scheduling Backpressure: Fixed scheduling by folding it into the queue so backpressure propagates correctly (#6470).
  • Router Metrics Collection: Fixed RouterRequestMetrics availability to ensure Router metrics are always collected (#6558).

Planner

  • Profiler Timeout and Crash Fixes: Fixed profiler deployment timeout handling for large MoE models and config generation to strip None arguments that caused crashes (#6086, #6887).
  • Profiler DGDR Validation: Fixed DGDR validator and DGD generation in the profiler, improved service name logging (#6876, #6112), and fixed profiling condition updates to populate results and clear phase after completion (#7195).
  • Planner CLI Configuration: Fixed disagg_planner.yaml and Planner test configs to use the updated CLI format (#6775, #7041, #7042).
  • Planner Backend Resolution: Fixed propagation of resolved backend and skipped interpolation for aggregated deployments (#7142).
  • Profiler TTFT/ITL Default Handling: Fixed Profiler validation error by using model_fields_set to distinguish TTFT/ITL default usage (#6827).

KV Block Manager

  • KV Cache Sleep/Wake Stability: Fixed KV cache block allocation signal after sleep/wake cycles and CUDA synchronization race conditions during GPU memory transitions (#5681, #5759).
  • KV Event Propagation and Block Management: Fixed KV event propagation for data-parallel multi-node deployments and KVBM to read block size from vLLM at runtime instead of using a hardcoded value (#5589, #5713, #5851).
  • KV Cache Memory Leak: Fixed memory leak where KV cache blocks were not freed on stream drop (#6246).
  • GMS Reliability: Fixed GMS CLI startup failure, removed unnecessary CUDA synchronize calls that degraded performance, and fixed GMS socket UUID resolution via the CUDA driver API (#5749, #6362, #6914).
  • KVBM CUDA Device Handling: Fixed PinnedAllocator to use the correct device_id, KVBM to respect CUDA_VISIBLE_DEVICES for NUMA binding, device_blocks double-counting in the TensorRT-LLM connector, and added authorization guards to memory occupation control endpoints (#6877, #6950, #6406, #7023).

Performance

SGLang

  • SGLang Metrics and Monitoring: Fixed metrics prefix format from sglang_ to sglang: and TokenizerMetricsCollector lazy-import to avoid collector registration errors (#5701, #6269).
  • SGLang Configuration Fixes: Fixed tool-call-parser flags to prevent configuration conflicts and DeepSeek-R1 recipe with watchdog timeout to prevent hangs (#5849, #6076).
  • SGLang Decode Handler: Fixed decode handler to ignore empty non-final stream chunks (#6304).
  • SGLang Build and API: Fixed container build conflict by removing python3-blinker and corrected multimodal item keys in the SGLang API (#5995, #5981).

TensorRT-LLM

  • TensorRT-LLM Stability: Fixed decode worker stability by temporarily disabling request cancellation and eliminated crashes caused by unsafe abort() calls (#5764, #5827).
  • TensorRT-LLM Multimodal Support: Fixed multimodal flag being silently ignored, multimodal hash support for TRT-LLM 1.3 apply_mm_hashes API, skipped encoder LLM creation for unsupported models (#6468, #6907, #6918), and fixed the multimodal preprocessor after the initial approach was reverted (#6920, #6993).
  • TensorRT-LLM Guided Decoding: Fixed handler to properly convert guided decoding dictionaries to GuidedDecodingParams (#6127).
  • TensorRT-LLM Multi-Node Deployment: Fixed multi-node worker SSH crash in non-root containers and removed deprecated beam_width parameter from health check (#6772, #6890).

vLLM

  • vLLM Worker Stability: Fixed worker graceful shutdown to prevent orphaned processes, decode worker logging format that caused CrashLoopBackOff, and worker registration for external/hybrid load balancing (#5818, #6267, #6833).
  • vLLM Disaggregated Serving: Fixed disaggregated serving by adding missing --is-decode-worker and --kv-transfer-config flags (#5843, #6554).
  • vLLM Launch Script Fixes: Fixed DeepSeek-R1 recipe checkpoint path, removed an unnecessary bash wrapper, and corrected launch scripts for disaggregated and speculative decoding (#5721, #6035, #6562).
  • vLLM Stream Handling: Fixed sampling parameter parsing in the EPD flow (#5813).
  • vLLM Performance Configuration: Fixed Docker image to use the CUDA sampler for better performance and corrected engine stats logging (#5613, #6566).
  • vLLM Multi-Worker Port Collisions: Fixed HTTP port collisions when multiple workers share a process (#7185).

Build & Container

  • Runtime Image Fixes: Fixed missing native libraries (nvlink, UCX, NIXL, CRT, Triton paths), corrected image tags across SGLang, TensorRT-LLM, and vLLM Dockerfiles (#6503, #6521, #6958, #6983, #6401), and updated UCX reference for performance (#7218).
  • TensorRT-LLM Dependency Fixes: Fixed missing msgpack dependency and pinned pydantic-settings below 2.13.0 for compatibility (#5799, #6339).
  • Build System Fixes: Fixed cross-platform NUMA module compilation, ai-dynamo-runtime wheel packaging to exclude NIXL shared libraries, CI container GIT_COMMIT_SHA population, and disabled media-ffmpeg feature by default (#6354, #6881, #7016, #6574).

Other

  • Core Infrastructure Fixes: Fixed ZMQ transport receive timeout to prevent hangs, Prometheus metric collisions via multi-registry scrape, stale NATS consumers, multi-node Slurm launch arguments, performance degradation from excessive logging in the EPD pipeline, and tool call validation (#5685, #5678, #5948, #5861, #6742, #5504).

Documentation

New Content

  • AKS Storage Guidance: Added Azure AKS storage guidance for Dynamo caches (#5581).
  • TensorRT-LLM Known Issues: Added known issues section for TensorRT-LLM backend (#5801).
  • Mocker Documentation: Added mocker component documentation (#5832).
  • GPU Memory Service: Added overview documentation for GPU Memory Service (#5920).
  • Disaggregated Serving Guide: Added disaggregated serving guide (#6024).
  • Quick Start Sections: Added quick start sections to KVBM and Router guides (#6043).
  • KVBM Disaggregated Setup: Added instructions for TensorRT-LLM KVBM disaggregated setup (#6055).
  • Architecture Docs: Added Discovery Plane documentation and refactored Event Plane with D2 diagrams (#6229).
  • Inference Gateway: Added inference gateway documentation page (#6319).
  • Agent Docs: Added agent readme and documentation (#6320).
  • Frontend Configuration: Documented Frontend requirement for model config file access (#6327).
  • Speculative Prefill Demo: Added multiturn_bench README with speculative prefill demo (#6502).
  • Dev Containers Troubleshooting: Documented Docker 29.x Dev Containers hang root cause and fix (#6505).
  • KV Indexer Docs: Added standalone KV indexer documentation (#6511).
  • Embedding Cache: Documented embedding cache support in vLLM and TensorRT-LLM (#6555).
  • SGLang Observability: Expanded SGLang observability guide with tracing and dashboards (#6556).
  • DGDR v1beta1: Documented v1beta1 DynamoGPUDynamicResource API (#6713).
  • vLLM Multimodal Router: Added docs for vLLM multimodal Router (#6568).
  • Nemotron-3-Super-FP8 Recipes: Added Nemotron-3-Super-FP8 deployment recipes for SGLang aggregated, SGLang disaggregated, and TensorRT-LLM disaggregated with model download manifests (#7254).
  • FastVideo Example and Guide: Added FastVideo text-to-video example with deployment guide and sidebar reorganization (#7283).
  • Getting Started Introduction: Added introduction page to the Getting Started section with platform overview (#7292).

New Release Artifacts

  • snapshot-agent Container: New container image for the Dynamo Snapshot agent. Runs as a privileged DaemonSet that uses CRIU and cuda-checkpoint to snapshot and restore GPU worker processes, enabling fast recovery without model reload. Pairs with the snapshot Helm chart for deployment (Preview in v1.0.0).
  • snapshot Helm Chart: New Helm chart for deploying the Dynamo Snapshot DaemonSet and its supporting resources (ConfigMaps, RBAC, signal host paths). Manages the lifecycle of the snapshot-agent across cluster nodes (Preview in v1.0.0).
  • dynamo-mocker Crate: New Rust crate that simulates inference engine behavior — token generation timing, KV cache allocation, and transfer latency — without requiring GPU hardware. Used for benchmarking Router and Planner behavior, testing disaggregated pipelines, and validating scaling policies.
  • dynamo-kv-router Crate: New standalone Rust crate for KV-aware request routing. Extracts the Router's prefix-matching, load-balancing, and KV cache event processing into a reusable library for disaggregated serving deployments.

For the full list of Dynamo v1.0.0 release artifacts, see: Release Artifacts.

Version Upgrades

Major Dependencies

  • SGLang 0.5.9: Upgraded SGLang to 0.5.9 with updated documentation (#6518).
  • TensorRT-LLM 1.3.0rc5.post1: Upgraded TensorRT-LLM.post1 from 1.2.0rc6.post2 to 1.3.0rc5 through intermediate release candidates (#5700, #6402, #6579), including major stability improvements and bug fixes (#6495).
  • vLLM 0.16.0: Upgraded vLLM from 0.12 to 0.16.0 through intermediate releases (0.14.1, 0.15.1), including compilation config updates for each version (#5691, #5819, #6102, #6652).
  • NIXL 0.10.1: Upgraded NIXL from 0.9.x to 0.10.1 with transfer library improvements (#6701, #6832).
  • AI Configurator 0.7.0: Upgraded AI Configurator to 0.7.0 (#6494, #6634, #6791, #6975, #7071, #7050).
  • AIPerf 0.6.0.post1: Upgraded AIPerf to 0.6.0.post1 with first-class integration as Dynamo's benchmarking framework and added guides for benchmarking and Router A/B testing (#5982, #7138, #7155, #7203).

Other Dependencies

  • Grove 0.1.0-alpha.6: Updated grove dependency to 0.1.0-alpha.6 for Snapshot integration (#6015).
  • Minor Dependency Upgrades: Bumped Rust oneshot to 0.1.13 (#5694), updated AWS SDK (#6878), and Go OTEL SDK to v1.40.0 (#6906).

For a list of dependencies for Dynamo v1.0.0 and past releases, see our Support Matrix.

Known Issues

DynamoGraphDeploymentRequest (Preview in v1.0.0)

Planner With Empty Defaults Fails on Non-AIC-Supported Model/Hardware

Applying a DGDR with features.planner: {} (empty defaults) on a model/hardware combination not supported by AIConfigurator causes the profiling job to fail with ValueError: Throughput-based planner scaling requires AIC support. The default planner config assumes throughput scaling with rapid in-depth sweeping, which requires AIC support. The Dynamo profiler validation raises a hard error before AIC is called, even though AIC PR#516 added the backend-side fix.

Workaround: Set features.planner: {pre_deployment_sweeping_mode: thorough} to bypass the AIC support gate check.

Profiler Rejects Valid SLA Combination

Specifying both optimizationType and ttft/itl SLA targets on a DynamoGraphDeploymentRequest triggers a Pydantic validation error because the schema treats them as mutually exclusive. The optimizationType field is not yet implemented in Dynamo 1.0.0, and any CRDs or manifests that reference it will fail validation. Users who upgrade from earlier versions with existing DGDR specs that include optimizationType alongside latency targets will see immediate admission errors.

Workaround: Remove the optimizationType field from SLA specifications. Use only e2eLatency or the ttft/itl pair (which must be specified together) — these two modes are mutually exclusive.

Interpolation Does Not Propagate Tolerations

Tolerations defined in overrides.dgd on a DynamoGraphDeploymentRequest are not propagated to candidate DynamoGraphDeployments created during the interpolation phase of profiling. This causes worker pods to remain in Pending state on clusters with tainted nodes, because the generated deployments lack the required tolerations to schedule onto those nodes. PR #7226 moved override application to before the interpolation step, but the fix is incomplete for all override paths and has been reopened. A complete fix is pending for a patch release.

Workaround: Manually add the required tolerations directly to each generated DynamoGraphDeployment after interpolation completes, or remove taints from target nodes during profiling.

Thorough Profiler Generates Infeasible TP=1 for MoE Models

The profiler's memory estimation does not account for WideEP communication buffers used by Mixture-of-Experts models, causing it to generate TP=1 configurations that are guaranteed to OOM at runtime. When the thorough profiler enumerates candidate configurations, it underestimates peak memory for MoE architectures, and the resulting deployment crashes immediately upon loading the model.

Workaround: Manually reduce kv_cache_ratio to approximately 0.75 in the profiler configuration to reserve headroom for WideEP buffers, or exclude TP=1 from the candidate search space by setting a minimum tensor parallelism degree.

Infeasible SLA Targets Silently Accepted

When a user specifies SLA targets (TTFT, ITL, or E2E latency) that cannot be met by any profiled configuration, the profiler logs a warning but does not surface it as a Kubernetes condition on the DynamoGraphDeploymentRequest status. Operators monitoring the DGDR via kubectl or cluster dashboards will see no indication that the requested SLAs are unachievable, leading to deployments that run but never meet their performance objectives. This issue has been moved to the backlog and will not be fixed in 1.0.0.

Workaround: After profiling completes, manually inspect profiler pod logs for warnings containing "infeasible" or "no valid configuration" to verify that the requested SLA targets are achievable.

Multimodal

TRT-LLM Disaggregated Multimodal Raises AttributeError

Running the disaggregated embeddings/prefill/decode pipeline (diagg_e_pd.sh) with TRT-LLM on multimodal models raises AttributeError: 'NoneType' object has no attribute 'keys' during input preprocessing. The root cause is that TRT-LLM does not support the token IDs and multimodal embeddings path in its LLM API; the preprocessor must fall back to passing a text prompt via default_multimodal_input_loader for the embeddings case. A fix was merged (#6840) and cherry-picked as #6920 in RC6, but the fix regressed and the issue persists in the v1.0.0 release.

Workaround: Use aggregated mode instead of disaggregated embeddings/prefill/decode for TRT-LLM multimodal workloads. A corrected fix is planned for a follow-up patch release.

Wan2.1 Video Diffusion Requires Manual imageio Install

Deploying Wan-AI/Wan2.1-T2V-1.3B-Diffusers for text-to-video generation fails with ModuleNotFoundError: No module named 'imageio'. The imageio package is intentionally excluded from the TRT-LLM runtime container to reduce image size, as video generation is an experimental feature. This is documented in docs/backends/trtllm/trtllm-video-diffusion.md.

Workaround: Install the package manually inside the container: pip install imageio imageio-ffmpeg.

Embeddings Cache with TensorRT-LLM and enable_block_reuse

Deploying a TensorRT-LLM multimodal workflow with Embeddings Cache and enable_block_reuse: true is not supported due to limitations in the backend. This will be supported in upcoming releases.

Workaround: Use Embeddings Cache with enable_block_reuse: false. All existing recipes, benchmarks, and guides already reflect this configuration.

Dynamo Snapshot

Snapshot Restore Fails on AKS for vLLM

Snapshot restore of vLLM workers on AKS does not fully reinitialize model state. A single restored worker appears healthy and passes readiness checks but returns empty responses with no generated tokens. Restoring multiple workers simultaneously can hang, causing inference requests to time out. This issue has only been observed on AKS.

Workaround: No workaround available. Fix planned for a follow-up patch release.

KVBM

Pinned Memory Allocation Failure on Blackwell GPUs

KVBM initialization may fail on Blackwell GPUs (GB200, B100, B200) with CUDA_ERROR_INVALID_VALUE when allocating pinned host memory. The root cause is that the PinnedAllocator was hardcoded to device_id 0 instead of using the actual device ID, which causes NUMA binding to select the wrong memory node. A partial fix (#6809) corrects the device ID in the allocator, but some Blackwell configurations may still encounter initialization failures depending on the NUMA topology.

Workaround: Ensure CUDA_VISIBLE_DEVICES is set to expose only the intended GPUs, and verify that the NUMA node assignment matches the GPU topology.

Performance Degradation When KVBM Is Enabled

Enabling KVBM may degrade inference performance compared to running without it — observed in vLLM disaggregated mode and TensorRT-LLM aggregated mode. KVBM is now enabled by default (#5602), so users may see lower throughput out of the box. The overhead comes from KV cache block management and transfer coordination, which adds latency to each request even when KV cache reuse rates are low.

Workaround: Disable KVBM by unsetting DYN_KVBM_ENABLE if KV cache sharing is not needed for your workload.

SGLang

HiCache NIXL Storage Backend Crash on Init

SGLang HiCache with --hicache-storage-backend nixl crashes during scheduler initialization with TypeError: expected str, bytes or os.PathLike object, not MHATokenToKVPoolHost. The HiCacheNixl backend passes the memory pool host object where a file path string is expected. This is an upstream SGLang bug, fixed in sgl-project/sglang#19517 but not yet included in the SGLang version pinned by Dynamo 1.0.0.

Workaround: Use a different HiCache storage backend (e.g., disk). HiCache works correctly with non-NIXL backends.

SGLang DSR1 Recipe Model Loading from PVC Failure

Deploying the SGLang DSR1 recipe or using it as a base config in the SLA profiler may fail because the model-download script downloads the model into a non-standard HuggingFace directory that ModelExpress cannot load, causing prefill and decode workers to enter CrashLoopBackOff.

Workaround: (1) Download the HF model into a standard HF directory and set HF_HOME to the PVC-mounted path, (2) update --model-path to point at the directory containing the downloaded HF cache (not supported for SLA profiler), or (3) provide HF_TOKEN so the model can be downloaded directly.

TensorRT-LLM

Qwen3‑235B‑A22B‑FP8 fails with CuTe Experimental NotImplementedError on Blackwell

Deploying the qwen3-235b-a22b-fp8 recipes (both agg and disagg) on GB200/Blackwell fails at runtime with: NotImplementedError: CuTe Experimental module is only supported on Cuda toolkit 13.1 and above!

Workaround: This is caused by a packaging mismatch in the container image: the nvidia-cutlass-dsl==4.3.4 wheel baked into the image is the CUDA < 13.1 variant that stubs out cutlass.cute.experimental by unconditionally raising NotImplementedError, while the image itself ships CUDA 13.1 and TensorRT‑LLM’s Blackwell FP8 GEMM path (cute_dsl_fp8_gemm_blackwell) requires cute.experimental to be functional


Looking Ahead

Dynamo v1.1.0 is targeted for April 29, 2026. Here's a small preview of what's already taking shape:

Longer Contexts, Lower Cost

FlexKV manages KV cache across HBM, host memory, and SSD so long-context and high-concurrency workloads don't hit GPU memory limits. Instead of dropping requests when HBM fills up, the system spills KV blocks to cheaper storage tiers and pulls them back on demand.

Resilient Routing at Scale

The KV indexer gains P2P state recovery and automatic ZMQ gap replay, keeping prefix matching correct through node failures without manual intervention. Multi-model and multi-tenant isolation ensures that shared clusters route requests to the right cache even when multiple models share the same infrastructure.

Unified Observability

Forward pass metrics on the event plane and Loki log aggregation with unified OTLP ingestion bring metrics, traces, and logs into a single pipeline. Operators debugging latency or throughput issues in disaggregated deployments no longer need to correlate data across separate tools.


Full Changelog: v0.9.1...v1.0.0