Dynamo v0.6.1 Release Notes
Summary
Dynamo 0.6.1 focuses on improving production readiness, our disaggregated inference architecture, and as always performance optimization. In addition, Dynamo 0.6.1 contains the second tranche of UX improvements and upgrades to provide a world-class dev experience. Dynamo seamlessly supports all major LLM frameworks:
- TensorRT-LLM
- vLLM
- SGlang
Production Readiness: Kubernetes deployment capabilities matured with comprehensive operator improvements including vLLM DP across multiple nodes, automated DGDR profiling as a Kubernetes CR, and intelligent Grove resource allocation. Pre-deployment validation prevents configuration errors before cluster deployment. The build system streamlined with Docker refactoring and devcontainer standardization for pytest compatibility.
KV Router: KV Router architecture evolved to support disaggregated prefill/decode serving with the prefill router now integrated directly into the frontend. Radix tree operations gained non-blocking locks for better concurrency, while Python bindings released the GIL during operations. Metrics collection expanded with TensorRT-LLM Prometheus support and a redesigned composition-based API.
Developer Experience: Documentation underwent a major reorganization to improve clarity and navigation, with content restructured into logical categories and broken links fixed. New guides cover KVBM connector APIs, KV Smart Router benchmarking, and request cancellation for all backends. Model recipes expanded with clearly stated GPU requirements as well as the first Qwen3-32B-FP8 recipe. Lastly, deployment guides added AIConfigurator examples for disaggregated inference.
Major Features & Improvements
Performance and Framework Support
- AIPerf Benchmarking: Updated benchmarking infrastructure by replacing genai-perf with aiperf across components/backends and benchmarking scripts (#3528, #3533, #3306) to standardize performance testing.
- Profiling Automation: Added support for YAML config input for pre-deployment sweep script (#3622) and automatic profiling config generation (#3787) to streamline performance optimization workflows.
- GKE Examples: Published GKE deployment examples (#2721) showcasing cloud platform compatibility.
- GB200 Support: Enhanced SGLang with experimental GB200 FP4 support and updated GB200 FP8 commands (#3745) for latest hardware optimizations.
- API Enhancements: Extended TensorRequest and TensorResponse to contain extra parameters (#3761) and added echo parameter validation for /v1/completions (#3813) for enhanced API capabilities.
- Python Performance: Optimized Python bindings with GIL release for radix tree operations and added dump_tree_as_events functionality (#3748) to improve concurrency.
- Model Management: Improved frontend with model config files (tokenizer.json et al.) retrieved from MX (#3659) and added Python binding for model download (#3593) to simplify model management.
- Metrics Optimization: Cached compiled regex patterns in Prometheus metrics filtering (#3825) for performance optimization.
Fault Tolerance & Observability
- Exception Handling: Implemented TensorRT-LLM exception catching (#3544) for improved error handling.
- Request Cancellation: Enabled request cancellation during or before stream establishment (#3635) to prevent resource leaks.
- Metrics Infrastructure: Added TensorRT-LLM Prometheus metrics support with prefixing and filtering (#3676), completed TensorRT-LLM and SGLang metrics validation (#3842), and redesigned metrics API from Trait to composition (#3687) for cleaner observability architecture.
- Audit Logging: Implemented NATS sink for audit logging (#3732) for comprehensive system tracking.
- Test Monitoring: Added test metrics upload (#3648) for continuous quality monitoring.
- Deployment Validation: Added multiple _core*.so detection in sanity_check.py (#3803) to prevent deployment issues.
Kubernetes Deployment
- Pre-Deployment Validation: Added pre-deployment checks (#3573) to validate cluster readiness.
- Multi-Node vLLM: Enabled vLLM data parallelism multi-node support in operator (#3595) for distributed deployments.
- Deployment Simplification: Streamlined GAIE deployment with blackbox available via simple flag (#3591) and enabled routers sync in EPP (#3657) for simplified deployment workflows.
- E2E Testing: Added e2e Dynamo deploy tests (#3243) for comprehensive validation.
- DGDR (Dynamo Graph Deployment Improvements: Added DGDR custom resource (#3489), refactored DGDR to use profiler's native configuration format (#3758), turned profiling k8s jobs into sample DGDR requests (#3864), and removed deploy/utils RBAC (#3771) to improve operator functionality.
- Grove Integration: Implemented Grove detection with automatic usage when available (#3789) for intelligent resource allocation.
- MoE Testing: Added vLLM MoE Kubernetes functional tests (#3672) for backend validation.
- Docker Refactoring: Refactored Docker builds by moving EPP build dockerfile (#3555), removing redundant COPY in dev stage of framework Dockerfiles (#3690), and removing unused build args with updated comments (#3688).
- Dev Environment: Standardized development environment with devcontainer configuration using /workspace paths (#3870) and removed hardcoded /workspace paths across tests (#3888) for pytest compatibility.
- Config Organization: Moved engine configs out of components directory (#3772) for better organization.
- EPP Simplification: Removed component parameter from EPP (#3831) to simplify configuration.
KV Block Manager
- GPU-to-Disk Offload: Enabled KVBM GPU offload to Disk bypassing CPU (#3510) to support performance benchmarking efforts.
- Architecture Simplification: Eliminated ETCD from leader-worker initialization (#3202) to simplify KVBM architecture and reduce dependencies.
Scheduling
Planner
- Prefill Discovery: Added prefill workers to discovery (#3709) for disaggregated serving support.
- Profiling Jobs: Planner's pre-deployment profiling job is now implemented as a DGDR custom resource for improved operator integration.
Router
- Request Cleanup: Implemented router frees request from slot manager on stopped requests (#3623) to prevent memory leaks.
- Data Parallelism Routing: Added DP rank routing (#3597) for data parallelism support.
- Radix Tree Concurrency: Implemented non-blocking lock for radix uploading and read lock for radix downloading (#3655) to improve concurrency.
- Prefill/Decode Disaggregation: Baked prefill router into frontend, supporting vLLM initially (#3762) as a major architectural enhancement for disaggregated inference.
Other
Python Bindings & API
- ABI Compatibility: Built Python package with ABI compatibility (cross py3.10+) (#3571) for broader Python version support.
- KServe Support: Added Python binding for KServe gRPC frontend (#3739) to support standard inference protocols.
Runtime Improvements
- Mutex Optimization: Replaced std::sync::Mutex with parking_lot::Mutex in runtime (#3740) for performance optimization.
- Optional Dependencies: Made nats_client optional internally (#3705) to reduce dependencies in minimal deployments.
Documentation
- Deployment Guides: Added AIConfigurator and disaggregated inference example for Dynamo vLLM (#3183) and added Kubernetes deployment guidance to KV router documentation (#3828) to improve deployment workflows.
- KV Router Documentation: Briefly described KV router limitations (#3716), added KV Smart Router A/B benchmarking guide (#3696), and included router benchmarking results (#3856) for comprehensive router documentation.
- Backend Documentation: Added cancellation docs for vLLM, TRT-LLM and SGLang backends (#3783) and moved SGLang and vLLM Prometheus metrics documentation to docs path (#3677) for better organization.
- Model Recipes: Added GPU details for model recipes (#3594) and initial TensorRT-LLM recipe for Qwen32B-FP8 (#3827) to expand model coverage.
- Documentation Structure: Reorganized documentation to make things clearer (#3658), addressed feedback and fixed broken links across repository (#3802), fixed doc links (guide → observability) (#3830), and added redirects (#3973) for improved navigation.
- Glossary Updates: Added MDC to glossary (#3616) and fixed support matrix link in readme (#3698) for better reference materials.AIconfig
- Build Documentation: Added deprecation notices for unused Docker build arguments (#3568) for clarity on build options
Bug Fixes
- KVBM Memory Optimization: Fixed vLLM multimodal Qwen CUDA OOM issue (#3598), reduced memory usage to avoid vLLM dsr1 OOM (#3660), and fixed NUMA sensitivity problem in KVBM for TP=1 (#3700) to resolve critical memory management issues.
- KVBM Stability: Avoided offloading redundant prefill blocks and fixed CUDA graph hanging (#3632) and fixed fallocate failure on some file systems (#2680) to ensure stable cache operations.
- Disaggregated Inference - GB200/H100: Fixed OOMs in GB200 default instructions (#3768), corrected GB200 NIXL instructions and max CUDA graph batch size on H100 (#3807), and corrected prefill/decode block defaults when no overlaps (#3811) to enable production-ready disaggregated deployments.
- Tool Calling Compatibility: Fixed DeepSeek tool parsing (#3557) and migrated to new implementation using parse_tool_calls_harmony_complete (#3685) to support multi-model tool calling.
- NATS Event Processing: Fixed NATS queue to use streaming API to prevent KV events from dropping (#3900) for reliable distributed event handling.
- Planner Metrics Accuracy: Standardized all planner TTFT/ITL units to float milliseconds (#3673) and fixed fault tolerance metrics calculation bug (#3674) to ensure accurate SLA measurements.
- Request Error Handling: Included request_id and error details in completions stream failure messages (#3860) for improved debugging and observability.
- vLLM Sampling: Ignored stop key in vLLM sampling params (#3879) to fix request processing issues.
- Kubernetes Stability: Fixed MPI flow and added resourceClaim (#3446) to improve multi-node deployment reliability.
Known Issues
SGLang DS-R1
- SGLang DS-R1 - Disaggregated Deployments: SGLang disaggregated deployments (8/16 GPU) for DeepSeek-R1 experience instability with KV transfer timeouts.
- SGLang DS-R1 - WideEP on H100: DeepSeek-R1 WideEP on H100 requires DeepGEMM kernel
- precompilation before deployment to prevent initialization failures.
- SGLang DS-R1 - SLA Profiling of DS-R1: SLA profiling for DeepSeek-R1 fails due to memory leak in SGLang.
What's Next
As we look ahead to Dynamo v0.7.0, we are focused on bringing advanced features while continuing to enhance the architecture and deployment capabilities of Dynamo. Planned highlights include:
- Multi-LoRA support for serving multiple adapters with intelligent routing
- KVBM integration with KV Router for cache-aware load balancing
- Initial work to remove ETCD and NATS dependencies, simplifying deployment architecture
Beginning with v0.7.0, all containers will be rootless for improved security. Additionally, we will release two new assets:
- NEW: Standalone KBVM pip wheels for independent integration with vLLM and TensorRT-LLM
- NEW: Dynamo frontend image with embedded Endpoint Picker enabling future integration with Kubernetes Gateway API Inference Extension
Lastly, we will release examples for Dynamo that showcase the synergistic performance gains when combining disaggregated serving, KV Router, KVBM, and Planner.