Releases: ai-dynamo/aiconfigurator
AIConfigurator Release v0.6.0
Release v0.6.0
This release focuses on collector upgrades, new/updated performance datasets (H100/H200/B200/Blackwell), and more robust config generation + CI automation.
Highlights
Collector upgrades + compatibility (SGLang/VLLM)
SGLang non-wideep collector upgraded to 0.5.6 (compatible with 0.5.5) (#176)
VLLM bumped to 0.12.0 (#181)
VLLM MLA collector updated for v0.12.0 (#197)
New attention/MLA collection + fixes
Added MLA attention collectors for VLLM (#177)
Fixed 1.2.0rc5 MLA + all-reduce generation (#196)
Blackwell / B200 enablement + datasets
Non-wideep SGLang collector Blackwell support (#218)
Added B200 TRTLLM 1.2.0rc5 data (#202)
Added B200 SGLang 0.5.6.post2 (no wideep) data (#223)
Fixed head dimension handling when not collecting Blackwell data (#236)
Performance DB refresh (H100/H200) + data cleanup
Removed old 0.20.0 DB and added new data from 1.2.0rc5 (H100 & H200) (#198)
Added new performance data for VLLM 0.12.0 (H100 & H200) (#199)
Added new performance data for SGLang 0.5.6.post2 (#200, #201)
Cleaned incomplete/old datasets (VLLM 0.11.0, SGLang 0.5.1.post1, TRTLLM 1.2.0rc2) (#204)
Updated H200 SGLang DB (#235)
More reliable generation + automation
“Lowest latency under SLA” support (#182)
Config/task/perf DB made more error-proof (+ L40S custom all-reduce data) (#183)
Added hf_token support in generated configs (#230)
Auto-download DeepSeek-V3 config from HuggingFace (#227)
CI: improved daily support matrix workflow automation/comparisons (#247)
Added cherry-pick workflow (#205)
Cherry-pick: add k8s_hf_home option (#305)
What's Changed
🚀 Features & Improvements
Upgrade SGLang non-wideep collector to 0.5.6 (compatible with 0.5.5) (#176)
Rename and simplify power-law functions for DeepEP MoE (#174)
Add MLA attention collectors for VLLM (#177)
Bump VLLM to 0.12.0 (#181)
Support “lowest latency under SLA” (#182)
Support 1-GPU collector (#185)
Make perf DB and task config more error-proof; add L40S SGLang custom all-reduce data (#183)
Delete 0.20.0 database and add new data from 1.2.0rc5 (H100 & H200) (#198)
Add new performance data for VLLM 0.12.0 (H100 & H200) (#199)
Add new performance data for SGLang 0.5.6.post2 (#200)
Add new data for SGLang 0.5.6.post2 on H200 (#201)
Make VLLM MLA collector compatible with v0.12.0 (#197)
Add B200 TRTLLM 1.2.0rc5 data (#202)
Refactor wideep collectors for collect.py framework with multiprocess support (#188)
Create cherry-pick.yml (#205)
SGLang non-wideep collector: Blackwell support (#218)
Add B200 SGLang 0.5.6.post2 data without wideep (#223)
Refactor tests and add marks for better management (#224)
Add hf_token support in AIC generated config (#230)
Collector: auto-download DeepSeek-V3 config from HuggingFace (#227)
CI: update daily support matrix workflow to enhance automation and comparison features (#247)
Cherry-pick: add k8s_hf_home option (#305)
🐛 Bug Fixes
Fix FP8 block GEMM collector (#171)
Use TTFT to filter prefill candidates (#169)
MoE args and workload distribution fallback (#168)
Delete wideep MLP for SGLang; improve DB/op query returns; fix collector repeat handling (#170)
Update DeepEP interface for SGLang 0.5.6+ compatibility (#172)
Use model_family for checks instead of model_name (#186)
Fix broken SGLang wideep deepseek path (#195)
Fix 1.2.0rc5 MLA and all-reduce generation (#196)
Delete incomplete data for VLLM 0.11.0, SGLang 0.5.1.post1, TRTLLM 1.2.0rc2 (#204)
Fix config generator missing MoE parallel config when using huggingface_id (#193)
Fix eval FileNotFoundError for service_mode=disagg output path (#194)
Add common code owners to avoid blocking merge (#225)
Update copyright date to 2025–2026 (#220)
Remove nvfp4 shape restriction (#221)
Fix automation pipeline bug (#217)
Fix ISL=1 and smaller local heads (#222)
Support matrix: update CSV + fix daily workflow (#226)
Default cache_transceiver_config.backend to DEFAULT (#231)
AIC eval: support replica > 1 (#234)
Include --max-model-len and --max-num-batched-tokens in VLLM run.sh (#238)
Update H200 SGLang database (#235)
Fix config generator for multiple replicas (#232)
Improve generator MoE parallelism for different backend (#237)
Add generator doc (#241)
Enable hybrid TP/DP/EP mode in wideep SGLang (#229)
Add w4a16_mxfp4 MoE data and set proper moe_quant_mode default for gpt-oss (#240)
Correct v_head_dim and head_dim_total when not collecting data for Blackwell (#236)
Fix multinode disagg config generator for GB200 (#242)
Fix TRTLLM tp=moe_tp × moe_ep behavior (#248)
CI: use self-hosted runners to avoid GitHub runner OOM (#252)
Add SGLang enable-mix-chunk for generator (#257)
Fix SGLang enable mixed chunk (#258)
Support matrix update (#270)
Update generator doc + allow graceful CLI exit when lacking DB data (#286)
Align generator run script with dynamo 0.8.0 (#283)
Use nixl as default disagg transfer backend for SGLang 0.5.6.post2 + allow CLI override (#287)
Fix VLLM/SGLang k8s template missing k8s_model_cache param (#285)
Move PVC support from frontend to workers for SGLang backend (#292)
Docs/guide updates on dynamo deployment + remove dynamoNamespace field (#300, #299)
Handle SGLang L40S missing data gracefully ([#306](https://github.co...
AIConfigurator Release v0.5.0.post0
AIConfigurator 0.5.0.post0
AIConfigurator 0.5.0.post0 is a patch release that updates container image compatibility and fixes copyright headers.
Release Highlights
This is a maintenance release for AIConfigurator 0.5.0 that ensures compatibility with Dynamo container image 0.8.0.
Changes
- Dynamo Container Compatibility: Updated AIConfigurator 0.5.0 to use the matched Dynamo container image 0.8.0 (#262)
- Copyright Update: Updated copyright date to 2025-2026 to pass CI checks (#264)
Full Changelog: v0.5.0...v0.5.0.post0
AIConfigurator Release v0.5.0
AIConfigurator 0.5.0
AIConfigurator 0.5.0 brings significant performance optimizations, expands backend support for vLLM and SGLang, and introduces new modeling capabilities including Power Estimation and Power Law workload distribution. This release also adds comprehensive support matrix testing.
Release Highlights
This version focuses on performance efficiency with optimizations to the generation engine and database lookups. New hardware data support includes L40S for SGLang, and we have expanded MoE (Mixture of Experts) support to the vLLM backend. Additionally, users can now target End-to-End (E2E) latency and estimate power consumption.
Features and Improvements
1. Performance Optimizations
- Engine Optimization: Optimized the implementation of run_generation and num_gpu lookups for faster execution (by @anish-shanbhag in #113, #114).
- Efficient Data Handling: Replaced dataframes with dictionaries for batch operations in InferenceSummary generation and added caching for repeated queries to improve speed (by @anish-shanbhag in #115, #128).
2. New Modeling Capabilities
- Power Estimation: Added support for estimating power consumption of configurations (by @kaim-eng in #153).
- Workload Distribution: Introduced a 'power_law' option for workload distribution in the CLI and prefill modeling (by @xutizhou in #147, #134).
- Hybrid Modeling: Added support for hybrid modeling scenarios (by @tianhaox in #125).
- Latency Targets: Users can now set E2E latency as a target metric (by @tianhaox in #145).
3. Framework and Hardware Support
- vLLM Support: Added MoE support for vLLM (by @ilyasher in #139) and generator support (by @Ethan-ES in #144).
- SGLang Support: Added support for WideEP TP attention modeling (by @AichenF in #143), L40S data (non-WideEP) (by @venkywonka in #165), and generator support (by @Ethan-ES in #144).
- DeepSeek: Replaced DeepSeek MLP with GEMM for better performance (by @AichenF in #155).
4. User Interface
- Profiler UI: Introduced a new Profiler UI for better visualization and analysis (by @Harrilee in #117).
- UI Updates: Relocated GPU cost references and updated profiling components (by @Harrilee in #167).
5. Build, CI and Test
- Testing Framework: Added a comprehensive support matrix testing framework (by @Harrilee in #126).
- Maintenance: Added a CODEOWNERS file for better repository management (by @Arsene12358 in #109).
Bug Fixes
- SGLang Fixes: Addressed vulnerabilities in the collector (#108), aligned GEMM quantization methods (#122), and fixed attention collection for the regular path (#123).
- MoE & Model Fixes: Fixed MoE memory issues and NVFP4 GEMM for TRT-LLM 1.x (#131), removed generation repeat attention (#148), and updated workload distribution logic for MoE/DeepSeek models (#146).
- CLI & Compatibility: Fixed CLI for GB200 with TP > 4 (#137), improved Python compatibility by using Union instead of | (#158), and relaxed Pydantic requirements (#161, #162).
- General Fixes: Fixed team name parsing (#130), updated custom_allreduce file locations (#156, #160), and removed PII from error stack traces (#166).
Documentation
- Added design documentation for Power Law distribution (by @YijiaZhao in #119, #129).
- Updated documentation to mention vLLM and SGLang support (by @jasonqinzhou in #159).
New Contributors
- @xueh-nv made their first contribution in #133
- @Harrilee made their first contribution in #117
- @gangmuk made their first contribution in #158
- @dmitry-tokarev-nv made their first contribution in #161
- @venkywonka made their first contribution in #165
- @kaim-eng made their first contribution in #153
- @bcfre made their first contribution in #175
Full Changelog: v0.4.0...v0.5.0
AIConfigurator Release v0.4.0
AIConfigurator 0.4.0
AIConfigurator is a tool that helps users find optimal configurations for deploying LLM inference workloads in distributed, multi-GPU environments.AIConfigurator 0.4.0 adds extensive support for the SGLang backend, covering DeepSeek WideEP path and regular path with dense and MoE models support. We also added dense models support for vLLM backend. With this release, AIConfigurator now supports all 3 major backends: TensorRT-LLM, SGLang, and vLLM.
Release Highlights
AIConfigurator 0.4.0 significantly expands backend support, achieving coverage for all three major backends. This release introduces support for L40S GPUs, Qwen3 30B A3B MOE models, and direct HuggingFace model loading via --hf_id.
Additionally, it adds prefix cache modeling support to simulate workloads with system prompts or prefix cache hits, and unifies SGLang paths for better maintainability.
Features and Improvements
1. New Hardware Support
2. Framework Support
- Added SGLang attention collector (by @Atream in #73)
- Enhanced allreduce data collector to enable data collection for vLLM backend (by @Arsene12358 in #87)
- Added SGLang disagg support (by @jasonqinzhou in #84)
- Added SGLang agg support (by @jasonqinzhou in #93)
- Added vLLM disagg support (by @ilyasher in #89)
- Added vLLM agg support (by @ilyasher in #98)
- Unified SGLang WideEP and regular paths (by @tianhaox in #99)
3. Expanded Model Support
- Supported using
--hf_idas an alternative to--model(by @simone-chen in #86) - Added Qwen3 30B A3B MOE model support (by @jasonqinzhou in #58)
4. Modeling and Improvements
- Added prefix length modeling support (by @tianhaox in #77)
- Added version subcommand (by @jasonqinzhou in #72)
5. Build, CI and Test
- Added linting and formatting with Ruff, created a developer guide (by @anish-shanbhag in #65)
- Added A100 to e2e test (by @simone-chen in #64)
Bug Fixes
- Added supported systems to CLI help (by @jasonqinzhou in #63)
- Fixed MLP context state (by @AichenF in #78)
- Moved Gradio to optional dependencies (by @Arsene12358 in #90)
- Fixed LLAMA2_7B and LLAMA2_13B errors (by @ilyasher in #97)
- Fixed webapp compatibility with SGLang and vLLM (by @tianhaox in #100)
- Fixed collector minor problems (by @tianhaox in #101)
- Enhanced log file collection with Path and error handling (by @xutizhou in #92)
Documentation
- Updated README to include A100 SXM in support matrix (by @simone-chen in #62)
- Added git lfs pull step before install from source code to download full data files (by @cr7258 in #69)
- Added more A100 docs (by @jasonqinzhou in #67)
New Contributors
AIConfigurator v0.3.0
AIConfigurator 0.3.0
AIConfigurator is a tool that helps users find optimal configurations for deploying LLM inference workloads in distributed, multi-GPU environments such as those using NVIDIA H100, H200, GB200, B200, A100, or future hardware with the Dynamo backend.
Currently AIConfigurator supports NVIDIA TensorRT-LLM as the primary inference engine, with limited support for SGLang.
Release Highlights
AIConfigurator 0.3.0 introduces significant expansion in hardware support, framework compatibility, and model coverage. This release adds support for multiple new GPU architectures, introduces SGLang framework integration, and expands the model library with new Qwen3 variants and GPT-OSS models.
Features and Improvements
1. New Hardware Support
- Added GB200 GPU support (by @YijiaZhao in #32)
- Added B200 GPU support with TensorRT-LLM 1.0.0rc6 data (by @tianhaox in #36)
- Added A100 GPU support (by @simone-chen in #55)
2. New Framework Support: SGLang and Wide-EP
Note: SGLang support is currently limited and experimental.
- Added SGLang GEMM collector and performance data (by @Atream in #28)
- Added SGLang MLA-BMM collector and performance data (by @Atream in #29)
- Added SGLang MLA collector and performance data (by @Atream in #31)
- Added SGLang fused MoE Triton collector (by @Atream in #39)
- Added support for disaggregated DeepSeek in SGLang (by @AichenF in #54)
3. Expanded Model Support
- Added several Qwen3 models (by @tianhaox in #30)
- Added GPT-OSS support in AIConfigurator SDK (by @Arsene12358 in #56)
4. Configuration Generation and Evaluation
- Refactored generator as a standalone module for improved modularity (by @Ethan-ES in #40)
- Added new CLI and SDK support for presets in search space configuration (by @tianhaox in #44)
- Added AIPerf integration for performance evaluation (by @Ethan-ES in #57)
- Improved aggregated and disaggregated modeling and performance (by @tianhaox in #45)
5. Collector Improvements
- Enhanced collector to support data collection for windowed attention and additional MoE configurations (by @Arsene12358 in #33)
Bug Fixes
- Fixed LICENSE file (by @saturley-hall in #21)
- Added allowed path workspace configuration (by @tianhaox in #23)
- Updated MoE tuning logic (by @YijiaZhao in #19)
- Updated Gradio version for compatibility (by @saturley-hall in #35)
- Improved error handling for database loading failures (by @tianhaox in #37, #38)
- Enhanced Kubernetes support with corresponding documentation (by @Ethan-ES in #50)
- Changed NVIDIA SMI command from -lgc to -ac (by @LyleLuo in #49)
- Excluded FP8 from MLA generation post-processing test cases for Ampere architecture (by @simone-chen in #52)
- Fixed TensorRT-LLM 1.0.0 collector compatibility (by @tianhaox in #48)
- Improved tensor initialization to occur directly on device (by @ilyasher in #51)
- Enabled SDK tests in CI pipeline (by @ilyasher in #46)
Documentation
- Added guidance for adding new models (by @tianhaox in #26)
- Added NVIDIA SMI clock locking script to README (by @jasonqinzhou in #47)
- Added git LFS pull step to installation instructions for downloading full data files (by @saturley-hall in #71)
- Enhanced A100 documentation (by @saturley-hall in #70)
New Contributors
- @Arsene12358 made their first contribution in #33
- @ilyasher made their first contribution in #41
- @biswapanda made their first contribution in #42
- @LyleLuo made their first contribution in #49
- @AichenF made their first contribution in #54
For the complete list of changes, see the full changelog.
AIConfigurator Release v0.2.0
AIConfigurator 0.2.0
AIConfigurator is a tool that helps users find optimal configurations for deploying LLM inference workloads in distributed, multi-GPU environments such as those using NVIDIA H100, H200, or future hardware with the Dynamo backend.
Currently AIConfigurator supports NVIDIA TensorRT-LLM as inference engine.
Release Highlights
AIConfigurator 0.2.0 brings several new features, improvements, and important fixes to enhance configuration workflows and automation.
Features and Improvements
1. Automation
2. Collector improvement
- Mix-of-Expert collector now supports autotuning for improved efficiency (by @YijiaZhao in #11)
3. Dynamo upgrade
Bug Fixes
- Switched to using torch flow collector and added more default memory configuration options (by @tianhaox in #7)
- Improved performance alignment logic and reliability (by @tianhaox in #10)
- Enhanced mixture-of-experts (MoE) support: added power law handling and improved solver calculation for generative attention (by @tianhaox in #15)
- Added safe directory creation to mitigate security risk and clarified error handling (by @tianhaox in #16)
Documentation
- Improved README (https://github.com/ai-dynamo/aiconfigurator/blob/main/README.md) for clarity and precision (by @nealvaidya in #9)
New Contributors
- @nealvaidya made the first contribution in #9
- @Ethan-ES made the first contribution in #13
For the complete list of changes, see the full changelog.
v0.1.1
What's Changed
🚀 Features & Improvements
- feat: feat: power_law_moe collector and webapp by @YijiaZhao in #2
🐛 Bug Fixes
- fix: update project name, version, system data support matrix by @tianhaox in #3
- fix: Harrison/fix spdx headers by @saturley-hall in #6
New Contributors
- @YijiaZhao made their first contribution in #2
- @tianhaox made their first contribution in #3
- @saturley-hall made their first contribution in #6
Full Changelog: v0.1.0...v0.1.1
v0.1.0 Initial release of AIConfigurator
AIConfigurator is a tool designed for Dynamo to optimize disaggregated serving for generative AI models. It automatically finds optimal deployment configurations by searching thousands of candidates in tens of seconds, helping you achieve better throughput and latency in disaggregated serving.
Major Features
- Automated Configuration Search: Search across thousands of deployment configurations to find optimal one of both disaggregated and aggregated system and do intelligent choice of disaggregated or aggregated deployment.
- SLA-based Optimization: Optimize under TTFT (Time-To-First-Token) and TPOT (Time-Per-Output-Token) constraints to address throughput@latency problem
- Dynamo Integration: Seamless integration with Dynamo by automatic generation of deployment configurations
- Multi-framework Support: Compatible with NVIDIA TensorRT-LLM backend with extensible architecture for other frameworks (coming soon)
Model and System Support
- Comprehensive Model Support:
- GPT
- LLAMA (2,3)
- MoE
- QWEN
- DEEPSEEK_V3
- NEMOTRON model families
- System Support: H200 SXM and H100 SXM
User Interfaces
- Command Line Interface (Suggested): Simple CLI with 3 basic arguments for quick start and configuration generation
- Web Application: Interactive web interface for advanced configuration tuning and visualization