Dynamo Release v0.4.1
Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models—across any framework, architecture, or deployment scale. It's an open-source project under the Apache 2.0 license. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.
Dynamo supports multiple large language model (LLM) inference engines (see Support Matrix for details)
- NVIDIA TensorRT-LLM
- vLLM
- SGLang
Release Highlights
This release brings substantial performance improvements for Deepseek R1, improved fault tolerance capabilities with high availability router testing, and groundbreaking KV cache management features. We've also significantly enhanced our Kubernetes deployment story with Grove integration and the new Inference Gateway, while expanding multimodal support across multiple backends.
Major Features and Improvements
1. Model Performance Breakthroughs
- Achieved significant Deepseek R1 wideEP performance with both SGLang (#2223) and TRT-LLM (#2387)
- Added in TRT-LLM support for variable sliding window attention (VSWA) for Gemma3 models (#2134)
- Launched Day0 support + deployment guide for GPT-OSS 120B on Blackwell GPUs (#2297)
2. Fault Tolerance & Observability Improvements
- Introduced testing for multiple KV routers and frontends for high availability (#2324)
- Completed end-to-end request migration testing with vLLM (#2177), ensuring seamless failover
- Added router-level request rejection (#2465) for better resource management under load
- Unified NATS, DRT & component metrics (#2292) for comprehensive system monitoring
- Made health checks more flexible with parameterized /health and /live endpoints (#2230)
3. Enhanced Kubernetes Deployments
Grove
- Unlocked multi-node support through Grove integration (#2269, #2405)
- Provided workaround for component scaling when using Grove (#2531)
Inference Gateway
- Launched Dynamo integration with API Gateway featuring EPP customization (#2345)
4. Advanced KV Cache Management & Transfer
KV Block Manager
- First release of KV Block Manager (KVBM) with vLLM, supporting tiered storage across HBM (G1), host memory (G2), and local disk (G3) (#2258)
LMCache integration
- Successfully integrated LMCache for improved cache efficiency (#2079)
5. Intelligent Planning & Routing
Router
- Enabled router replicas with state-sharing for improved scalability (#2264)
Planner
- Extended SLA Planner integration to support SGLang dense models (#2421)
6. Others
Multimodal model support
- Shipped multimodal examples with vLLM v1 (#2040)
- Added comprehensive Llava model deployment example with vLLM v1 (#2628)
- Brought multimodal support to TRT-LLM backend (#2195)
Guided decoding
- Implemented frontend support for Structured Output and Guided Decoding (#2380)
Frontend improvements
- Added capability to serve multiple models from a single endpoint (#2418)
- Introduced LLM metrics for non-streaming requests (#2427)
Bug fixes
- Resolved metrics collection timeout issues (#2480, #2506)
- Standardized component metric names to dynamo_component_* pattern, preventing Kubernetes label collisions (#2180)
- Fixed runtime error propagation in endpoint.rs (#2156)
- Corrected processor/router unit queuing behavior with NATS (#1787)
- Added missing dependencies to SGLang runtime build (#2279)
- Improved HuggingFace token handling in preprocessor tests (#2321)
- Implemented detokenize stream functionality (#2413)
Documentation
- Created comprehensive TRT-LLM deployment examples for Kubernetes (#2133)
- Authored SGLang deployment guide (#2238)
- Developed MetricsRegistry API guides (#2159, #2160)
- Published guide for collecting and viewing Dynamo metrics in Kubernetes (#2271)
- Released Dynamo Inference Gateway documentation (#2257, #2260)
- Created SGLang hicache example and guide (#2388)
Build, CI, and Test
- Implemented KV routing tests for SGLang (#2424)
- Completed request migration end-to-end testing with vLLM (#2177)
- Converted vLLM multimodal example to pytest framework (#2451)
- Added ZMQ library support for TRT-LLM's UCX connection establishment (#2381)
- Created unit tests for SLA planner's interpolator (#2505)
Migration Notes
Component metric names have been standardized to the dynamo_component_* pattern. Users monitoring these metrics should update their dashboards and alerting rules accordingly.
Looking Forward
This release sets the foundation for even more ambitious features in our H2 roadmap. The new KV cache management capabilities and multi-node support open doors for larger-scale Dynamo deployments, while our enhanced observability features ensure you can confidently run Dynamo in production.
Release Assets
Python Wheels:
Rust Crates:
Containers:
- nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
- nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.4.1
- nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.4.1
- nvcr.io/nvidia/ai-dynamo/kubernetes-operator:0.4.1
Helm Charts:
Contributors
We welcome new contributors in this release:
@qimcis, @yinggeh, @da-x, @elyasmnvidian, @ryan-lempka, @JesseStutler, @nate-martinez, @suzusuzu
Full Changelog: v0.4.0...v0.4.1