Skip to content

Dynamo v0.9.0

Latest

Choose a tag to compare

@dagil-nvidia dagil-nvidia released this 12 Feb 03:46
· 1 commit to release/0.9.0 since this release
76c1889

Dynamo v0.9.0 Release Notes

Summary

Dynamo v0.9.0 completes the infrastructure decoupling started in v0.8.0, expands multimodal and diffusion model support across all three backends, and introduces smarter scheduling with predictive load estimation and routing hints.

Infrastructure Modernization

The new Event Plane—built on high-performance ZMQ transport with MessagePack serialization—joins the Discovery Plane and Request Plane to form a fully decoupled communication architecture. Dynamo deployments no longer require NATS or etcd: Kubernetes-native service discovery replaces etcd, KV router queries run over the native Dynamo endpoint instead of NATS, and the Event Plane provides a transport-agnostic pub/sub layer for system events. These changes simplify deployment topology and reduce operational dependencies.

Multimodal & Diffusion

Dynamo expanded multimodal support across all three backends in this release. Encoder disaggregation is now available for both vLLM (via the Embedding Cache connector) and TRT-LLM (via a standalone encoder), allowing encoding to run on a separate GPU from prefill/decode. Dynamo can now serve multimodal SGLang workloads on a single GPU instead of requiring a full E/PD split. We also added first-class support for diffusion-based language models — LLaDA2.0 can now be served alongside autoregressive models in the same Dynamo deployment.

Scheduling Intelligence

Router gained output block tracking with fractional decay for predictive load estimation, expected output token awareness, and support for routing hints from external orchestrators like Kubernetes Gateway API Inference Extension (GAIE). The Planner added Kalman filter and mooncake-style warmup for more accurate load prediction, along with SLA-driven autoscaling for MoE DEP/TEP configurations. The Profiler was enhanced with PVC model cache support and model name validation.

Kubernetes & Observability

Operator added rollout restart for DynamoGraphDeployments, observability metrics, tolerations/affinity for GPU-specific scheduling, and improved restart reliability. Distributed tracing now spans the full request path including TCP transport, and the Prometheus metrics stack was simplified with multi-registry scrape support.


First-Time Contributors

We welcome 14 new contributors to the Dynamo project:

  • @siclait contributed a PR that truncates HttpError messages to 8192 characters to prevent ValueError on long messages (#5020).
  • @smatta-star contributed a PR that adds auto-generated OpenAPI spec and helper binary for the frontend (#4802).
  • @shpgy-shpgy contributed a PR that fixes multimodal processing error when handling pure text conversations (#5088).
  • @chay1045 contributed a PR that fixes hidden stop tokens appearing in output by returning None instead (#5238).
  • @wenqiglantz contributed a PR that adds prompt embeds support for pre-computed inference inputs in vLLM (#4739).
  • @yurekami contributed a PR that preserves original model path for frontend config downloads (#5102).
  • @erezzarum contributed a PR that fixes NIXL CUDA12 + CUDA13 build compatibility (#5000).
  • @soodoshll contributed a PR that fixes usage returning None when using text mode with vLLM (#5336).
  • @ls-2018 contributed a PR that fixes tag error handling (#5236).
  • @debermudez contributed a PR that updates aiperf to v0.4.0 (#5331).
  • @wangshangsam contributed a PR that updates vLLM import paths to align with upstream main (#5447).
  • @AbhiOnGithub contributed a PR that adds __all__ exports and __repr__ methods for improved debugging (#5606).
  • @davilu-nvidia contributed a PR that resolves SGLang E/P/D multimodal routing issues (#5500).
  • @adityapuranik99 contributed a PR that adds cupy-cuda12x to SGLang extras for CUDA compatibility (#5627).

Major Features & Improvements

Infrastructure Modernization

Discovery Plane

  • K8s-Native Service Discovery: Enabled Kubernetes-based discovery in GAIE and updated Helm charts/RBAC to support etcd-less deployments, allowing Kubernetes users to deploy without running a separate etcd cluster (#5303, #5432, #5364).
  • etcd Reliability: Resolved potential deadlocks in legacy etcd usage and updated examples to run without etcd, ensuring stable startup for users still on etcd-based discovery (#5091, #5422).
  • List-and-Watch Diffing: Resolved diffing logic issue where worker metadata updates (e.g., LoRA adapter additions) were not picked up, causing stale routing decisions (#5318).

Request Plane

  • NATS Dependency Removal: Migrated KV router worker queries to the native Dynamo endpoint to reduce NATS traffic (#5451), made NATS optional for KV-aware routing in approximate mode so local development works without a NATS server (#5237), fixed NATS container startup failure caused by invalid --max_payload CLI flag by moving it to config file (#5384), and cleaned up asymmetric request plane configuration in launch scripts (#5245).

Event Plane

  • Event Plane Architecture: Introduced a transport-agnostic Event Plane with MessagePack serialization and auto-discovery, decoupling system events (KV cache transfers, notifications) from direct NATS dependency. Added high-performance ZMQ transport as a scalable alternative for latency-sensitive event channels while preserving NATS for backward compatibility (#5674, #5614, #5624).
  • Event Plane NATS Init: Corrected NATS initialization logic based on --event-plane argument across all backends, preventing silent failures when NATS is not configured (#5750).
  • ZMQ Transport Timeout: Added receive timeout for ZMQ transport to prevent indefinite hangs when a publisher is unavailable (#5804).

Networking

  • IPv6 Support: Added IPv6 support for SGLang disaggregation with proper address formatting, enabling deployments on IPv6-only networks (#5521).

Multimodal & Diffusion

SGLang

  • Aggregated Multimodal: Enabled Dynamo to serve multimodal SGLang workloads on a single GPU, removing the previous requirement for a 2-GPU E/PD split (#5450).
  • Diffusion LM Support: Enabled Dynamo to serve diffusion-based language models (LLaDA2.0) through the SGLang backend, using existing Dynamo infrastructure for pre/post processing with a new diffusion handler (#5533).
  • Multi-Image Qwen EC: Resolved multi-image bug in the Dynamo EC connector that dropped images beyond the first in multimodal requests (#5514).

TensorRT-LLM

  • Standalone Encoder: Added encoder disaggregation support to Dynamo's TRT-LLM integration, enabling encoding to run on a separate GPU from prefill/decode (#4668).
  • Multimodal Tokenizer Reuse: Optimized Dynamo's multimodal request pipeline for TRT-LLM by reusing the tokenizer across requests instead of reinitializing per request, reducing per-request latency (#5217).

vLLM

  • Embedding Cache Connector: Added the Embedding Cache (EC) connector to Dynamo's vLLM integration for encoder disaggregation, where the encoder stores embeddings by hash and PD workers consume them from cache—eliminating redundant encoding and reducing TTFT. Also enabled multiple image inputs per request and parallelized image loading (#5162, #5463, #5444).
  • Prompt Embeds Support: Added pre-computed embeddings as a secure input method to Dynamo, allowing applications to transform sensitive data into embeddings before submission for improved privacy and flexible prompt engineering (#4739).
  • EPD Refactor: Refactored Dynamo's EPD handler to orchestrate the full encode-to-PD flow (processor → encoder → processor → PD), supporting multiple multimodal data items per request instead of just one (#4994).
  • Decode Worker Qwen-VL: Resolved disaggregated decode crash for Qwen2.5-VL models caused by missing image_grid_thw data needed for mRoPE position encoding (#5281).
  • EPD Sampling Params: Corrected sampling params parsing in Dynamo's vLLM EPD flow that could silently produce incorrect generation parameters (#5833).

Performance & Hardware

  • SGLang Stream Output: Enforced stream_output=True in SGLang ServerArgs, switching from cumulative-to-delta token conversion to direct disjoint segment passthrough—reducing per-token processing overhead in streaming responses (#5510).
  • Multimodal Payload Optimization: Removed serialization/deserialization in gather_multi_model_data, significantly reducing latency for requests with large base64-encoded payloads (#5485).
  • Zero Copy TCP Decoder: Implemented zero copy decoder with bounded worker pool for TCP ingress, eliminating memory leaks under high concurrency and reducing per-message allocations (#5376).
  • MoE Data Parallel Tuning: Reduced VLLM_MOE_DP_CHUNK_SIZE to 384, lowering HBM footprint enough to enable inference on 16xH200 MoE configurations that previously hit OOM (#5307).
  • TRT-LLM GB200 Support: Resolved memory allocation failure on GB200 hardware (#5328) and updated the Wide-EP disaggregated GB200 recipe for compatibility with latest TRT-LLM version (#5383).

Router

  • Router Scheduling Intelligence: Added output block tracking with fractional decay for predictive load estimation (#5452), plumbed expected output tokens so the router can account for generation length when distributing requests (#5181), and added a flag to disable decode KV reuse assumption so the router computes actual block hashes for more accurate cache-hit predictions (#5350).
  • Routing Hints from Headers: Added support for reading routing hints from request headers, allowing external orchestrators (e.g., GAIE) to influence routing decisions without modifying the request body (#5502).
  • PrefillComplete Hook: Implemented PrefillComplete handling in Dynamo EPP Scorer Plugin, eliminating the router-to-EPP sync overhead that added latency on every prefill completion (#5592).
  • KV State Routing Examples: Added examples for KV state approximation-based routing, demonstrating how to use approximate KV cache state for routing without full NATS dependency (#5320).
  • Client Instance Reconciliation: Added periodic refresh of available instances to recover from missed KV store updates, preventing stale routing to removed workers (#5043).

Planner

  • Load Predictor Improvements: Added mooncake-style trace warmup so the predictor starts with realistic data instead of cold-starting (#5529), Kalman filter as a lower-latency alternative to ARIMA (#5554), and log1p(y) fallback when ARIMA collapses to prevent degenerate zero-traffic forecasts (#5545).
  • SLA Planner for DEP/TEP: Extended MoE planner profiler to support TEP/DEP (Tensor/Data Expert Parallelism) configurations with vLLM backend, enabling SLA-driven autoscaling for MoE models (#4783).
  • Planner RBAC Fix: Added endpointslices RBAC permission for Planner service account, which was preventing the planner from discovering workers in Kubernetes deployments (#5195).

Profiler

  • Profiler Enhancements: Added PVC model cache support to avoid re-downloading models (#5124), model path mounting for custom model locations (#5212), explicit model name validation to catch misconfiguration early (#5978), and refactored WebUI preview config to use planner DGD generation logic for consistency between UI and actual deployments (#4940).

Frontend

  • OpenAI API Enhancements: Added response_format for structured decoding via JSON schema, enabling constrained generation through vLLM's structured output backend (#5127). Added continuous_usage_stats for per-chunk token usage in streaming responses (#5139), chat_template_kwargs alias for compatibility with existing clients (#5112), and auto-generated OpenAPI specifications for API documentation (#4802).
  • MiniMax Tool Parser: Added MiniMax 2.1 tool call parser support, enabling structured tool use with MiniMax models (#5549).
  • Usage in Text Mode: Corrected usage returning None when using text mode with vLLM, so clients can track token consumption in non-chat completions (#5336).
  • Multiple Text Components: Corrected handling of multiple text components in a single request, which previously dropped all but the first text element and produced incorrect prompts (#5196).
  • Response Body Size Limit: Limited max response middleware body size with get_body_limit() instead of usize::MAX, preventing potential memory exhaustion from oversized error messages (#5268).

KV Block Manager

  • KVBM Object Storage Support: Added NIXL object storage backend (S3) for KV cache transfers, enabling remote KV cache storage via S3-compatible endpoints instead of requiring direct GPU-to-GPU transfers (#5060, #5059, #5063).
  • Positional Lineage Hash: Introduced PositionalLineageHash, a 128-bit hash that encodes parent-child block relationships directly in the hash value—enabling backward traversal without pointer chasing for faster cache lookups (#5522).
  • CUDA Memory Pools: Enabled CUDA memory pools for vectorized KV cache transfer, preventing data corruption that occurred when transfer buffers were freed before async operations completed (#5475).
  • Remove Store Recursion: Replaced recursive KV cache event storage with an iterative solution, preventing stack overflow on long context sequences (>32K tokens) that exceeded the default stack size (#5497).
  • KVBM Reliability Fixes: Resolved KvCacheConfig YAML settings (e.g., enable_block_reuse) being silently lost when publish_events_and_metrics was enabled (#5198), KV store event drops for long context sequences (#5499), and a race condition where requests arrived before KV subscriber initialization completed (#5149).

Kubernetes Deployment

  • Operator Enhancements: Added rollout restart mechanism for DynamoGraphDeployment so operators can trigger rolling updates (#5118), enabled operator observability metrics for monitoring operator health (#5543), made service topology immutable to catch invalid topology changes at validation time (#5240), and disabled Scaling Adapter creation by default to simplify initial deployments (#5180).
  • Helm Directory Refactor: Refactored deploy/helm to remove manual installation paths and legacy "cloud" references, simplifying Helm-based deployments (#5042).
  • Namespace Normalization: Normalized Dynamo namespace computation from authoritative sources instead of a deprecated user field, ensuring metrics are correctly labeled and queryable across namespaces (#5231).
  • Tolerations and Affinity: Added comprehensive tolerations and affinity support for all platform components, enabling scheduling on tainted or GPU-specific nodes (#5757).
  • Default Affinities: Corrected default affinities value in operator Helm chart from list to object, which previously caused Helm template rendering errors (#5083).
  • Service Object Labels: Applied user-defined labels to Kubernetes Service objects in operator, which were previously silently dropped (#5459).
  • Pod Hash Restriction: Restricted pod hash for worker/instance ID to <2^53 to prevent JavaScript floating-point precision loss that caused incorrect routing decisions when IDs exceeded 2^53 (#5471).
  • Operator Restart Reliability: Corrected structured logging in restart status tracking (#5802), added status file to prevent output-copier hang on failures (#5939), fixed restart state tracking for parallel restarts that caused incorrect status reporting (#5959), and replaced hardcoded status strings with constants (#5756).
  • Container Security: Changed container mounting to not default to privileged mode, reducing the attack surface for container deployments (#5644).

Fault Tolerance & Observability

  • TRT-LLM Tracing: Added tracing context propagation for TRT-LLM backend, enabling end-to-end inference tracing across frontend, router, and TRT-LLM engine in Tempo/Grafana (#5377).
  • TRT-LLM Request Migration: Implemented request migration on worker shutdown via shutdown_event, ensuring in-flight requests are migrated to healthy workers instead of being dropped (#5599).
  • SGLang Metrics: Exposed TokenizerMetricsCollector metrics via Prometheus for SGLang runtime monitoring (#5120).
  • Distributed Tracing: Added span event logging controlled by DYN_LOGGING_SPAN_EVENTS for debugging request flows (#5400), and restored TCP transport trace propagation that was silently dropping all trace headers—causing disconnected spans in Tempo/Grafana (#5283).
  • Metrics and Export: Simplified Python metrics API to a single Prometheus Exposition Format callback, removing 91% of unused metrics code (#5594). Avoided Prometheus metric collisions via multi-registry scrape for deployments running multiple metric sources (#5741), restored cached_tokens metric for non-streaming requests (#5193), and resolved Tokio runtime panics caused by OTEL_EXPORT_ENABLED=true (previously only "1" was accepted) (#5129).

Version Upgrades

  • vLLM v0.14.1: Upgraded vLLM to v0.14.0 (#5593, #5222) and then to v0.14.1 (#5840).
  • SGLang v0.5.8: Upgraded SGLang to v0.5.8 (#5655, #5148).
  • TensorRT-LLM v1.3.0rc1: Upgraded TensorRT-LLM to v1.3.0rc1 (#5807, #5580, #5356, #5017).
  • NIXL v0.9.0: Upgraded NIXL to v0.9.0 (#5528).
  • Go v1.25: Upgraded Go toolchain and base images to v1.25.0 (#5241).
  • CUDA v12.9.1: Upgraded CUDA to v12.9.1 for the vLLM container (#5397).
  • CUDA 13 Container Builds: Added CUDA 13 container builds for vLLM and SGLang (#5218).
  • UCX v1.20.0: Upgraded UCX to v1.20.0 (#5058).
  • AIC v0.6.0: Upgraded AIC dependency to v0.6.0 (#5600).
  • GAIE v1.2.1: Updated GAIE to release version with routing hints in headers (#5503) and enabled GAIE v1.2.1 for recipes (#5955).
  • AIPerf v0.4.0: Upgraded AIPerf pinned version to v0.4.0 (#5331).

Examples & Recipes

  • Triton Worker Example: Added Triton worker example demonstrating how to expose Triton Inference Server models via Dynamo's distributed runtime with service discovery and routing (#4971).
  • Hello World etcd Removal: Updated hello_world example to run without etcd, aligning with the infrastructure modernization to remove external dependencies (#5422).

Deprecation Notices

  • Dynamo Graph Helm Chart: The deploy/helm Dynamo graph Helm chart is deprecated as of v0.9.0. The Dynamo Operator and DynamoGraphDeployment CRD are now the recommended deployment method for Kubernetes. The operator provides rollout restart, observability metrics, topology validation, and autoscaling capabilities that the Helm chart does not support. Users should migrate to operator-based deployments.

Bug Fixes

  • Truncate HttpError Message: Fixed handling of long error messages by truncating them to 8192 characters instead of raising a ValueError (#5020).
  • Pure Text Conversations: Fixed error when processing pure text conversations in multimodal contexts (#5088).
  • vLLM Multimodal Fixes: Applied vLLM multimodal minor fixes (#5792).
  • Race Condition in TP>1: Fixed race condition in TP>1 where ImmediateTransferResult arrived before CreateSlot (#5393).
  • KServe Error Propagation: Fixed KServe to propagate errors to client in stream infer instead of silently failing (#5263).
  • Frontend Model Path: Preserved original model path for frontend config downloads (#5102).
  • Audio Before Text: Fixed multimodal requests where audio content appears before text, which previously raised ValueError due to hardcoded content array indexing (#5143).
  • Generation Prompt Behavior: Fixed logic for adding generation prompt to match vLLM's native behavior, preventing unexpected tokens at the start of generated output (#5223).
  • Hidden Stop Tokens: Fixed issue where hidden stop tokens appeared in output text instead of being suppressed, causing unwanted EOS tokens in client responses (#5238).
  • vLLM Data Parallel Ports: Fixed ZMQ port conflicts when running multiple vLLM workers by assigning unique ports per dp_rank, eliminating "Address already in use" errors (#5224).
  • vLLM GPU Memory Utilization: Adjusted GPU memory utilization settings to accommodate vLLM's runtime memory requirements and prevent allocation failures (#5766).
  • HF Model Name Resolution: Fixed SGLang/vLLM to use the HF model name instead of the full local filesystem path, resolving model identification failures in deployments without shared storage (#5274).
  • HF URL Frontend Config: Fixed frontend to send HF URLs rather than local filesystem paths, so the frontend can download model configs even when nodes do not share a filesystem (#5290).
  • vLLM Graceful Shutdown: Fixed vLLM graceful shutdown to properly terminate worker processes, preventing orphaned GPU processes after service stop (#5835).
  • Served Model Name Priority: Fixed model name resolution to check --served-model-name first before falling back to --model/--model-path (#5900).
  • KVBM Cache Dissemination: Fixed how vLLM disseminates cached requests to KVBM, ensuring KV cache events are correctly tracked for cache-aware routing (#5976).
  • SGLang Disagg Decode Fallback: Fixed disaggregated decode fallback without --router-mode kv, which was accidentally removed and caused errors when KV routing was not configured (#5075).
  • SGLang OTEL Embedding Models: Fixed OTEL instrumentation causing import errors for SGLang embedding-only deployments (#5100).
  • SGLang Metrics Fixes: Fixed SGLang metrics collection and prefill router issues (#5147), and corrected the metrics prefix from sglang_ to sglang for consistency with upstream (#5709).
  • SGLang YAML Parsing: Fixed YAML config parsing for store_true arguments that silently produced incorrect SGLang configuration (#5513).
  • SGLang PYTHONPATH: Fixed ModuleNotFoundError when running SGLang containers with Slurm/enroot as root user by adding PYTHONPATH (#5071).
  • TRT-LLM Thread Synchronization: Fixed blocking synchronization that caused startup delays and health check timeouts during TRT-LLM engine initialization (#5333).
  • TRT-LLM OOM Prevention: Fixed example TRT-LLM worker parameters that caused out-of-memory errors with default settings (#5250).
  • TRT-LLM Container Dependencies: Added missing dependencies to the TRT-LLM container: NCCL symlink (#5257), numactl (#5367), and msgpack (#5823).
  • Prefill Router Round Robin: Fixed prefill router round-robin logic that was unevenly distributing prefill requests across workers (#5313).
  • Push Handler Stop Word: Fixed worker's push_handler non-blocking error upon stop word that caused premature stream termination (#5157).
  • Concurrent LoRA Loading: Fixed race conditions when multiple concurrent requests attempt to load the same LoRA adapter simultaneously (#5184).
  • Runtime Config Notification: Fixed blocking on notification of at least one runtime config, preventing requests from being processed before any backend instance is ready (#5191).
  • MPI Argument: Fixed missing mpi argument in srun commands for correct multi-node execution (#5984).

Documentation

  • Documentation Refactor: Consolidated all component documentation under docs/components/, reorganized sidebar navigation with nested dropdowns for Router, Planner, KVBM, and Frontend, surfaced Integrations as a new sidebar section (LMCache, SGLang HiCache, FlexKV, KV Events), and added a comprehensive Disaggregated Serving Guide with D2 architecture diagrams (#6019, #6024).
  • Fern Full Migration: Completed full migration of documentation to Fern format, rebuilding all pages from the restructured content for versioned documentation at docs.ai-dynamo.dev (#5445, #6050).
  • Documentation Improvements: Updated TRT-LLM install prerequisites (#5194), fixed broken links (#5258, #5330), clarified UCX configuration (#5247), and improved benchmarking docs (#5258).
  • Feature Compatibility Matrix: Added feature compatibility matrix so users can quickly determine which backends support which features (#5349, #5395, #5646).
  • Kube-based Service Discovery: Updated documentation to reflect that Kubernetes-backed discovery is now the default, replacing etcd-based instructions (#5211).
  • Release Artifacts: Added comprehensive artifact inventory in release-artifacts.md, documenting all containers, packages, and Helm charts shipped per release (#5619).
  • Metrics Docs KVStats: Updated metrics documentation after kvstats removal to prevent users from configuring removed metrics (#5710).
  • TRT-LLM NIXL Setup: Corrected TRT-LLM NIXL backend setup instructions that contained incorrect configuration steps (#5791).
  • README and Contributing Updates: Refactored Dynamo README and quick_start_local.rst for v0.9.0 (#5838), applied quick fixes (#5582, #5587), removed PRs merged badge (#5623), and improved the contributor experience of CONTRIBUTING.md (#5507).
  • Documentation Fixes: Fixed SGLang docs links (#5904), tracing doc ZMQ port conflict (#5200), Distributed_Inference README formatting (#5964), docker-compose path (#5839), and added NIXL backend configuration with typo fixes (#5564).
  • Documentation Audit: Comprehensive documentation audit and updates for 0.8 (#5380).
  • Qwen3 Recipes Update: Updated Qwen3-235B-A22B-FP8 recipes (#5930).
  • GAIE Integrations README: Fixed README for GAIE integrations (#5902).
  • Disagg Multinode Example: Added host and bootstrap port to disaggregated multinode example, which were missing and caused connection failures (#5309).
  • pip install trtllm: Updated README for pip install ai-dynamo[trtllm] to document the TRT-LLM pip install path (#5312).
  • KV Events Docs: Added KV events documentation explaining the event-driven cache management architecture (#5386).
  • KVBM pip install: Added pip install section to KVBM README for users installing outside containers (#5423).
  • Version Availability: Clarified which features are only available in v0.8.1 and later, preventing confusion when users reference v0.8.0 docs (#5492).
  • Recipe Feedback: Addressed VDR feedback for recipes—fixed bugs, improved docs, and added READMEs for each recipe directory (#5479).

Known Issues

TensorRT-LLM

Qwen2VL Multimodal Not Supported in EPD Disaggregated Mode

Qwen2VL multimodal models fail in TRT-LLM EPD disaggregated mode with AttributeError: 'Qwen2VLInputProcessorBase' object has no attribute 'support_mm_disagg'. This is an upstream TRT-LLM limitation — the Qwen multimodal input processor does not implement the disaggregated multimodal interface.

Workaround: Use non-disaggregated mode for Qwen2VL multimodal, or use a different model family (e.g., LLaMA) for EPD disaggregated multimodal inference.

Multimodal Garbage Output When image_url Precedes Text

TRT-LLM disaggregated mode produces incoherent output for multimodal requests where the image_url content block precedes the text content block in the OpenAI-format chat message. This affects Qwen multimodal models in TRT-LLM's disaggregated path and is an upstream TRT-LLM issue.

Workaround: Ensure text content appears before image_url in the request content array.

GPT-OSS-120B Worker Segfault

The gpt-oss-120b recipe using TRT-LLM workers crashes with a segmentation fault during model inference. This is an upstream TRT-LLM issue affecting large model deployments. The recipe test is skipped until an upstream fix is available.

DeepSeek-R1-FP4 Multi-Node OOM

Multi-node TRT-LLM deployment of DeepSeek-R1-FP4 fails with an out-of-memory error immediately after model weights are 100% loaded. The OOM occurs during inference engine initialization, not during weight loading. This is an upstream TRT-LLM memory management issue. The affected test is skipped until an upstream fix is available.

KVBM

TRT-LLM KVBM Disagg Benchmark Hang on H100

The TRT-LLM KVBM disaggregated benchmark hangs and times out after 300 seconds on H100 hardware. Fixes are in progress.

Inference Gateway

GAIE Basic Black Box Integration Test Failure

The inference-gateway basic black box integration test fails due to ErrImagePull or flag provided but not defined errors. The test cannot pull the required container image or encounters unrecognized CLI flags in the deployed EPP.

Kubernetes

Rollout Restart Status Stuck in "Restarting" During Concurrent Restarts

When a second rollout restart is triggered on a DynamoGraphDeployment while a previous restart is still in progress, the status.restart.phase field remains stuck in "Restarting" and never transitions to completion. Single rollout restarts work correctly. The issue only occurs when restarts overlap.

Router

Router Decision Tests Fail on vLLM and SGLang

The test_router_decisions_vllm_dp and test_router_decisions_sglang_dp tests fail with AssertionError: Timeout waiting for workers. Found 0/1 instance(s), expected 2. Data-parallel workers do not register in time for the router to discover them. Fix deferred to Dynamo v1.0.0.

Workaround: Build Dynamo from ToT.

Inference Gateway (GAIE EPP)

GAIE EPP Missing DYNAMO_MODEL Environment Variable

GAIE EPP recipes fail at startup with No ModelDeploymentCard found for model: Qwen/Qwen3-0.6B because the EPP defaults to Qwen/Qwen3-0.6B when DYNAMO_MODEL is not set. The recipe deployment manifests do not include this environment variable, causing the EPP to look for the wrong model.

Workaround: Add DYNAMO_MODEL environment variable to the EPP deployment manifest with the correct model name (e.g., RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic for the Llama 3 70B recipe).

GAIE EPP dynamo-inject-workerid Not Found

GAIE EPP deployments fail with plugin type 'dynamo-inject-workerid' is not found in registry because the EPP configmap references plugins (dynamo-inject-workerid, dynamo-cleanup) that are not registered in the v0.9.0 EPP image.

Workaround: Remove the dyn-pre (dynamo-inject-workerid) and dyn-cleanup (dynamo-cleanup) plugin entries from the EPP configmap YAML.

Planner

Planner Scaling E2E Test Fails With AIPerf Timeout

The Planner scaling E2E test (test_scaling_e2e) fails with RuntimeError: Load generation timed out during Phase 1 baseline load generation (8 req/s). The LoadGenerator invokes aiperf profile with a calculated timeout (2x duration + 120 s), but AIPerf does not complete within that window. The root cause is an AIPerf v0.4.0 profiling hang during initial load generation, the same upstream timeout issue that affects automated SLA profiling.

Profiler

SLA Profiler DEP Configuration Failures

The SLA Profiler's MoE DEP (Data Expert Parallelism) configuration test (sla_config_moe, #4783) has multiple open failures that prevent reliable automated profiling of DEP configurations. Known issues include: the profiler splitting bash block-scalar arguments into individual array elements, generating memory-infeasible DEP candidates for vLLM, and a decode OOM during EPLB all_gather where model load (105.52 GiB) plus EPLB rearrangement exceeds H200 memory. These are tracked across multiple bugs and fixes are in progress.

AIPerf

Profiling Timeout During Start Profiling

Automated SLA profiling may fail with a TimeoutError during the "Start Profiling" phase. This is a known issue in AIPerf v0.4.0 and is not a Dynamo-specific bug. Awaiting upstream fix.


Looking Ahead to Dynamo v1.0.0

Dynamo v1.0.0 is targeted for March 11, 2026. Building on the infrastructure modernization completed in v0.9.0, the focus shifts to performant production-grade serving, platform automation, and new inference paradigms.

Performance

AIConfigurator improvements across all backends, fully composed recipes combining KV-aware routing, disaggregated serving, and KV cache offloading for turnkey deployment.

Production-Grade Serving

Hierarchical Planner for heterogeneous worker pools, request rejection for overload protection, fast recovery with continuous availability, and WideEP fault tolerance.

Kubernetes Platform

Grove topology-aware orchestration with GB200 automation, ModelExpress performance optimization for model loading.

Agentic Workflows

Predictive routing with proactive load balancing, intelligent KV cache retention for high-reuse sessions, and KV cache offloading/prefetching for tool calls.

Multimodality and Diffusion

Multimodal hash router support for vLLM and SGLang, E/P/D disaggregation optimization, and support for SGLang Diffusion/Omni and vLLM Omni.