Release Dynamo Release v0.4.1 · ai-dynamo/dynamo

Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models—across any framework, architecture, or deployment scale. It's an open-source project under the Apache 2.0 license. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.

Dynamo supports multiple large language model (LLM) inference engines (see Support Matrix for details)

NVIDIA TensorRT-LLM
vLLM
SGLang

Release Highlights

This release brings substantial performance improvements for Deepseek R1, improved fault tolerance capabilities with high availability router testing, and groundbreaking KV cache management features. We've also significantly enhanced our Kubernetes deployment story with Grove integration and the new Inference Gateway, while expanding multimodal support across multiple backends.

Major Features and Improvements

1. Model Performance Breakthroughs

Achieved significant Deepseek R1 wideEP performance with both SGLang (#2223) and TRT-LLM (#2387)
Added in TRT-LLM support for variable sliding window attention (VSWA) for Gemma3 models (#2134)
Launched Day0 support + deployment guide for GPT-OSS 120B on Blackwell GPUs (#2297)

2. Fault Tolerance & Observability Improvements

Introduced testing for multiple KV routers and frontends for high availability (#2324)
Completed end-to-end request migration testing with vLLM (#2177), ensuring seamless failover
Added router-level request rejection (#2465) for better resource management under load
Unified NATS, DRT & component metrics (#2292) for comprehensive system monitoring
Made health checks more flexible with parameterized /health and /live endpoints (#2230)

3. Enhanced Kubernetes Deployments

Grove

Unlocked multi-node support through Grove integration (#2269, #2405)
Provided workaround for component scaling when using Grove (#2531)

Inference Gateway

Launched Dynamo integration with API Gateway featuring EPP customization (#2345)

4. Advanced KV Cache Management & Transfer

KV Block Manager

First release of KV Block Manager (KVBM) with vLLM, supporting tiered storage across HBM (G1), host memory (G2), and local disk (G3) (#2258)

LMCache integration

Successfully integrated LMCache for improved cache efficiency (#2079)

5. Intelligent Planning & Routing

Router

Enabled router replicas with state-sharing for improved scalability (#2264)

Planner

Extended SLA Planner integration to support SGLang dense models (#2421)

6. Others

Multimodal model support

Shipped multimodal examples with vLLM v1 (#2040)
Added comprehensive Llava model deployment example with vLLM v1 (#2628)
Brought multimodal support to TRT-LLM backend (#2195)

Guided decoding

Implemented frontend support for Structured Output and Guided Decoding (#2380)

Frontend improvements

Added capability to serve multiple models from a single endpoint (#2418)
Introduced LLM metrics for non-streaming requests (#2427)

Bug fixes

Resolved metrics collection timeout issues (#2480, #2506)
Standardized component metric names to dynamo_component_* pattern, preventing Kubernetes label collisions (#2180)
Fixed runtime error propagation in endpoint.rs (#2156)
Corrected processor/router unit queuing behavior with NATS (#1787)
Added missing dependencies to SGLang runtime build (#2279)
Improved HuggingFace token handling in preprocessor tests (#2321)
Implemented detokenize stream functionality (#2413)

Documentation

Created comprehensive TRT-LLM deployment examples for Kubernetes (#2133)
Authored SGLang deployment guide (#2238)
Developed MetricsRegistry API guides (#2159, #2160)
Published guide for collecting and viewing Dynamo metrics in Kubernetes (#2271)
Released Dynamo Inference Gateway documentation (#2257, #2260)
Created SGLang hicache example and guide (#2388)

Build, CI, and Test

Implemented KV routing tests for SGLang (#2424)
Completed request migration end-to-end testing with vLLM (#2177)
Converted vLLM multimodal example to pytest framework (#2451)
Added ZMQ library support for TRT-LLM's UCX connection establishment (#2381)
Created unit tests for SLA planner's interpolator (#2505)

Migration Notes

Component metric names have been standardized to the dynamo_component_* pattern. Users monitoring these metrics should update their dashboards and alerting rules accordingly.

Looking Forward

This release sets the foundation for even more ambitious features in our H2 roadmap. The new KV cache management capabilities and multi-node support open doors for larger-scale Dynamo deployments, while our enhanced observability features ensure you can confidently run Dynamo in production.

Release Assets

Python Wheels:

Rust Crates:

Containers:

Helm Charts:

Contributors

We welcome new contributors in this release:
@qimcis, @yinggeh, @da-x, @elyasmnvidian, @ryan-lempka, @JesseStutler, @nate-martinez, @suzusuzu

Full Changelog: v0.4.0...v0.4.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamo Release v0.4.1

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Release Highlights

Major Features and Improvements

1. Model Performance Breakthroughs

2. Fault Tolerance & Observability Improvements

3. Enhanced Kubernetes Deployments

Grove

Inference Gateway

4. Advanced KV Cache Management & Transfer

KV Block Manager

LMCache integration

5. Intelligent Planning & Routing

Router

Planner

6. Others

Multimodal model support

Guided decoding

Frontend improvements

Bug fixes

Documentation

Build, CI, and Test

Migration Notes

Looking Forward

Release Assets

Contributors

Contributors

Uh oh!