TensorRT-LLM Release 0.21.0
Key Features and Enhancements
- Model Support
- Added Gemma3 VLM support
- Features
- Added large-scale EP support
- Integrated NIXL into the communication layer of the disaggregated service
- Added fabric Memory support for KV Cache Transfer
- Added MCP in ScaffoldingLLM
- Added support for w4a8_mxfp4_fp8 quantization
- Added support for fp8 rowwise quantization
- Added generation logits support in TRTLLM Sampler
- Added log probs support in TRTLLM Sampler
- Optimized TRTLLM Sampler perf single beam single step
- Enabled Disaggregated serving for Qwen-3
- Added EAGLE3 support for Qwen-3
- Fused finalize and allreduce for Qwen-MoE model
- Refactored Fused MoE module
- Added support for chunked attention on Blackwell and Hopper
- Introduced sliding-window attention kernels for the generation phase on Blackwell
- Updated DeepSeek FP8 TRT-LLM Gen cubins to improve performance in large batch size scenarios
- Added FP8 block-scale GEMM support on SM89
- Enabled overlap scheduler between draft forwards
- Added Piecewise cuda graph support for MLA
- Added model-agnostic one-engine eagle3
- Enabled Finalize + Allreduce + add + rmsnorm fusion
- Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner
- Added support for Eagle3 + disaggregated serving in two model speculative decoding flow
- Validated Llama 3.1 models on H200 NVL
- Benchmark:
- Added all_reduce.py benchmark script for testing
- Added beam width to trtllm-bench latency command
- Fixed trtllm-bench iter_stats and cuda_graph_batch_sizes errors
- Enabled trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA
- Supported post_proc for bench
- Added no_kv_cache_reuse option and streaming support for trtllm serve bench
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:25.05-py3
. - The base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:25.05-py3
. - The dependent public PyTorch version is updated to 2.7.1.
- The dependent TensorRT version is updated to 10.11.
- The dependent NVIDIA ModelOpt version is updated to 0.31.
- The dependent NCCL version is updated to 2.27.5.
API Changes
- Set _AutoDeployLlmArgs as primary config object
- Removed decoder request from decoder interface
- Enhanced the torch_compile_config in llm args
- Removed the redundant use_kv_cache field from PytorchConfig
- Moved allreduce_strategy from committed api to reference
Fixed Issues
- Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
- Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
- Fixed cuda graph padding for spec decoding (#4853)
- Fixed llama 4 long context issue (#4809)
- Fixed max_num_sequences calculation with overlap scheduling (#4532)
- Fixed chunked prefill + overlap scheduling (#5761)
- Fixed trtllm-bench hang issue due to LLM API IPC (#4798)
- Fixed index out of bounds error in spec decoding (#5954)
- Fixed MTP illegal memory access in cuda graph warmup (#5947)
- Fixed no free slots error with spec decode + disagg (#5975)
- Fixed one-off attention window size for Gemma3 1B (#5564)
Known Issues
- accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken.
- Enabling disaggregated serving, MTP, and the overlap scheduler at the same time can lead to accuracy problems.
- In 0.21, full chunked attention support has been added to make sure LLaMA4 model can functionally run with > 8K seq length, while there is a known performance regression(only affect LLaMA4 model) on Hopper due to this functional enhancement. The root cause of the regression has been identified already and the fix will be part of the future release.
What's Changed
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5221
- [test] split nemotron test cases from examples_test_list by @crazydemo in #5238
- Update DeepSeek R1 perf numbers to latest release/0.20 results by @litaotju in #5235
- [feat] Add llm args to tune python gc threshold by @nv-yilinf in #5141
- [TRTLLM-5835][feat] Optimized Mamba2Mixer prefill by @tomeras91 in #5128
- [TRTLLM-3456] Speculation: Draft Target in new FW by @IzzyPutterman in #4558
- chore: Waive CI failure. by @SimengLiu-nv in #5252
- [infra] Make test_chunked_prefill faster by @mikeiovine in #5248
- Update internal cutlass commit. by @Tracin in #5228
- test: add more pytorch cases in perf test by @ruodil in #5237
- Fix: https://nvbugs/5345720 by @QiJune in #5259
- test: [CI] remove closed bugs by @xinhe-nv in #5218
- [TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP by @syuoni in #5215
- fix mla test by @qsang-nv in #5240
- doc: add document of benchmarking for Qwen3 by @byshiue in #5158
- update setup.py for special cases by @qsang-nv in #5227
- move some test cases of TensorRT backend back by @QiJune in #5232
- [feat] Add EAGLE3 support for Qwen3 by @nv-yilinf in #5206
- [TRTLLM-5786][https://nvbugspro.nvidia.com/bug/5310520][test] Add QA test cases by @crazydemo in #5073
- CI: move multi-gpu test cases of tensorrt backend to h200 by @QiJune in #5272
- refactor: Unify decoder test with e2e worklfow by @Funatiq in #5239
- [feat] Piecewise cuda graph support for MLA by @liji-nv in #4467
- chore: Mass integration of release/0.20 by @amirkl94 in #5082
- [TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner by @DomBrown in #5207
- None - Some clean-ups for the automation pipeline by @chzblych in #5245
- Re-implement LlmResponse in Python to reduce host overhead of pybind by @QiJune in #5224
- delete cubins by @qsang-nv in #5274
- infra[TRTLLM-5635] remove package stage in CI build by @niukuo in #5075
- [Infra] - Update dependencies with NGC PyTorch 25.05 and TRT 10.11 by @EmmaQiaoCh in #4885
- [chore] Remove BaseDraftTokenManager by @mikeiovine in #5251
- [infra] Report CI authorization errors to PR by @tburt-nv in #5175
- Revert "[infra] Report CI authorization errors to PR" by @tburt-nv in #5298
- refactor: Update decoder buffer and logits management by @Funatiq in #4450
- fix: only set _mpi_session if world_size is > 1 by @achartier in #5253
- update LlmRequest.is_dummy property by @QiJune in #5283
- test: update qa test list by @crazydemo in #5305
- CI: extend model weights load time for dsv3 in stress test. by @dominicshanshan in #5275
- [fix][test] move deepseek single gpu tests to post merge by @omera-nv in #5280
- Waive L0 tests by @yiqingy0 in #5308
- feat: Add non-streaming support for trtllm serve bench script & fixed prompt and output token length by @yizhang-nv in #4971
- chore: partition LLM class into TorchLLM and TrtLLM by @Superjomn in #4900
- [feat]: improve performance of XQA-MLA for sm120 by @lowsfer in #5087
- doc:update contributing md for internal developers by @nv-guomingz in #5250
- test: cherry-pick deepseek rcca cases in main branch by @ruodil in #5307
- [TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. by @hyukn in #5139
- CI: fix TensorRT H200 tests by @QiJune in #5301
- [TRTLLM-5758] test: Add Bielik-11B-v2.2 Model Support by @Wanli-Jiang in #5159
- chore: Refine printed info of CHECK_TYPE. by @bobboli in #5295
- refactor: Introduce ResourceManagerType enum for resource management by @Funatiq in #5246
- chore: bump version to 0.21.0rc3 by @ZhanruiSunCh in #5309
- test: correct unittest rerun behavior by @tongyuantongyu in #5273
- Fix rerun step by @yiqingy0 in #5319
- Waive L0 by @yizhang-nv in #5311
- tests: add multi nodes tests by @xinhe-nv in #5196
- feat: Add LLGuidance Support for PyTorch Backend by @jellysnack in #5214
- [Infra]Update 5080 and 5090 case condition since we will upgrade driver by @EmmaQiaoCh in #5317
- chore: Update README.md to expose meet-up info by @juney-nvidia in #5329
- Remove duplicated test cases by @HuiGao-NV in #5323
- Add disagg slurm scripts by @qiaoxj07 in #5243
- Unwaive disaggregated serving accuracy tests by @Tabrizian in #5095
- [feat] Multi-node CI testing support via Slurm by @yuanjingx87 in #4771
- [fix][test] remove some cpp test cases from h100 by @omera-nv in #5335
- [fix][test] remove duplicate test runs by @omera-nv in #5241
- chore: skip test_llm_gpt2_medium_fp8 for fp8_pc_pt + quant_lm_head by @achartier in #5293
- [fix][test] clear cuda cache before unittests automatically by @omera-nv in #5121
- fix[nvbug5298640]: trtllm-llmapi-launch multiple LLM instances by @Superjomn in #4727
- ci: Split long running jobs into multiple jobs by @Funatiq in #5268
- [feat] Fusion finalize and allreduce for qwenmoe model by @zongfeijing in #5223
- chore: remove torch_compile prefix for TorchCompileConfig field members by @nv-guomingz in #5261
- [test] add nvfp4 DeepSeek-V3-Lite-mtp tests by @lfr-0531 in #5125
- Waive L0 test by @yiqingy0 in #5349
- chore: bump version to 0.21.0 by @yiqingy0 in #5325
- tests: cherry-pick from main branch, add qwen3 test cases and amend test name in perf test by @ruodil in #5357
- [Infra]cherry pick sanity check yml change for 5080 and 5090 from main by @EmmaQiaoCh in #5363
- doc: cherry pick #5334 by @MartinMarciniszyn in #5368
- fix: Fix skip by mpi size fixture by @yizhang-nv in #5355
- Fix: missing clientId when serialize and deserialize response (cherry-pick #5231) by @kaiyux in #5378
- tests: fix typos in qa test by @crazydemo in #5421
- nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mistral Small multimodal for BS=8 by @brb-nv in #5453
- feat: TRTLLM-5941 Upgrade xgrammar to 0.1.18 by @Wanli-Jiang in #5364
- test: set enable_attention_dp=True in default deepseek settings by @ruodil in #5461
- tests: Set kv cache free memory fraction in test case by @HuiGao-NV in #5462
- [Infra] - Waive failed tests on release/0.21 by @EmmaQiaoCh in #5477
- Fix permission for local user issues in NGC docker container. by @MartinMarciniszyn in #5373
- [nvbug 5273941] fix: broken cyclic reference detect by @Superjomn in #5417
- [nvbug/5354825] Fix nougat test image url by @amukkara in #5496
- fix: fix regression in LOCAL_USER by @ixlmar in #5517
- doc: Fix benchmark cmd in disagg scripts by @kaiyux in #5516
- fix: constrain grepping in docker/Makefile by @ixlmar in #5493
- [Infra][release/0.21] - waive failed tests by @EmmaQiaoCh in #5537
- ci: unwaive llmapi launch test by @Superjomn in #5281
- [TRTLLM-5989, TRTLLM-5991, TRTLLM-5993] doc: Update container instructions by @ixlmar in #5490
- [cherry-pick] [CI] Waive
test_fp8_block_scales_4gpus[ep4-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False]
by @venkywonka in #5553 - [Infra][release/0.21]Update nccl to 2.27.5 by @EmmaQiaoCh in #5539
- fix [nvbug5351244]: test_mpi_session submit sync/async by @Superjomn in #5608
- fix:https://nvbugs/5362398 by @nv-guomingz in #5609
- [nvbug 5300551] test: increase block count in eviction test by @zhengd-nv in #5465
- test: add more tests for GB200 with 8 GPUs/2 nodes in L0 tests by @yizhang-nv in #5397
- doc: Fix outdated config in DeepSeek best perf practice doc by @kaiyux in #5638
- fix: [https://nvbugs/5355219] Fix bug of Qwen3 235B CI on dgx_gb200 by @byshiue in #5602
- [https://nvbugspro.nvidia.com/bug/5351333][fix] Update to chunking calculation. by @FrankD412 in #5625
- fix: Investigate Gemma3 1B decoder output discrepancy by @brb-nv in #5564
- [Infra] - Waive failed cases on release/0.21 by @EmmaQiaoCh in #5674
- Doc: Update invalid hugging face URLs by @Linda-Stadter in #5683
- [NVBUG:5355009] Modify check for fuse_fp4_quant on SM120 by @farazkh80 in #5651
- [TRTLLM-6100] fix: Nvbug 5356427: autotuned TRTLLM Gen fp8 block scale MoE illegal memory access by @DomBrown in #5676
- [nvbug/5341178][fix] Fix OOM in Llama 4 accuracy test by @brb-nv in #5735
- test: Move some of the test from post merge to pre-merge, update dgx b200 test case by @yizhang-nv in #5640
- [5321981] fix: Fix the Llama3.1 405B hanging issue. by @hyukn in #5698
- [Infra][nvbugs/5370968] - Unwaive l0 test by @yiqingy0 in #5750
- [nvbugs/5336321][fix] Enable attention dp = False test case, Fix TRTLLM Gen Moe workspace allocation by @yizhang-nv in #5463
- [nvbug/5337601][fix] Fix disagg + speculative decoding by @Tabrizian in #5558
- [Infra] - Always use x86 image for the Jenkins agent by @chzblych in #5756
- test: fix some test failure and add llama_nemotron models in perf sanity test, add more torch cases by @ruodil in #5693
- fix: Skip rope scaling for local layers in Gemma3 VLM by @brb-nv in #5773
- [nvbug 5004744][fix] rewrite completion API to avoid repetitive tokens by @LinPoly in #5201
- fix _pad_attention_dp_dummy_request by @QiJune in #5583
- Fix docker cache mount by @MartinMarciniszyn in #5763
- [nvbug/5302638][nvbugs/5310314] fix _handle_cancelled_requests by @QiJune in #5532
- cherry pick #5416 by @QiJune in #5776
- [nvbug 5304752][fix] enhance _check_arguments to filter illegal requests for pytorch backend by @LinPoly in #5541
- [nvbug5266240] chore: unwaive test_llm_with_dummy_weights by @Superjomn in #5744
- [https://nvbugspro.nvidia.com/bug/5355054] fallback to cubins for fp8 fmha kernels on Ada. by @PerkzZheng in #5779
- fix: [https://nvbugspro.nvidia.com/bug/5345215] Unwaive for bug 5345215. by @bobboli in #5606
- [nvbugs/5326453] Avoid nesting NCCL grouping in allgather OP by @QiJune in #5789
- fix: [https://nvbugs/5351130][https://nvbugs/5333654] Unwaive for bug 5351130 and 5333654. by @bobboli in #5821
- doc: Update gb200 doc by @yizhang-nv in #5840
- test: remove duplicate cases in perf sanity test by @ruodil in #5870
- [nvbug 5327706][fix] fix mgmn postprocess error by @LinPoly in #5835
- [nvbugs/5345391] fix: chunked prefill + overlap scheduling by @Funatiq in #5761
- cherry-pick: [fix: nvbugs/5355493] Correctly clamp max sequence len to max attention window by @netanel-haber in #5874
- [https://nvbugs/5355316] fix: update torch.compile option to fix triton store_cubin error by @dc3671 in #5865
- test: Add Gemma3 unit tests to CI in release/0.21 by @brb-nv in #5899
- tests: Fix lora perf test by @amirkl94 in #5875
- fix: [nvbugs/5351130] Adjust DSV3-Lite tests free_gpu_memory_fraction to 0.75 to prevent OOM on CI. by @bobboli in #5896
- chore: Port leftover 0.20 by @amirkl94 in #5907
- fix [nvbug/5351244]: address remote mpi session submit by @Superjomn in #5664
- fix: [5328141] increase tolerance for test_fp8_block_scale_gemm by @nekorobov in #5849
- fix: timeout and broken pipe in disagg and worker tests by @zhengd-nv in #5827
- [nvbugs/5333742] fix MTP illegal memory access in cuda graph warmup by @lfr-0531 in #5947
- fix: fix index out of bounds error in spec decoding by @lfr-0531 in #5954
- [nvbugs/5368410][fix] Disable moe allreduce for multi node by @yizhang-nv in #5918
- [fix] Release slots with spec decode + disagg by @Tabrizian in #5975
- [TRTLLM-6495] doc: add disclaimer for 3rd party software installation. by @nv-guomingz in #6039
- [None] - Waive L0 tests by @yiqingy0 in #6082
- Cherry Pick: PR #6076 by @ZhanruiSunCh in #6088
- add release notes for 0.21 release by @QiJune in #6049
- fix: Fix triton backend build [nvbug 5396469] by @pcastonguay in #6098
- [None][infra] Cherry-pick #6128 and #6130 from main branch by @chzblych in #6151
- [Doc][Qwen3] update qwen3 into support-matrix by @byshiue in #6161
- [fix]: Revert commit 388b491 by @LinPoly in #6143
- doc: update known issues by @QiJune in #6247
- [fix] Cherry pick "[TRTLLM-6262] Fix Llama4 Scout FP4 crash issue" by @mikeiovine in #6267
- doc: update release notes by @QiJune in #6324
- test: Relax Gemma3 unit test thresholds by @brb-nv in #6016
- tests: Add llama4 functional cases by @crazydemo in #6392
- doc: update release notes by @QiJune in #6438
- [https://nvbugspro.nvidia.com/bug/5415268] fix illegal smem access with chunked attention by @PerkzZheng in #6401
- [doc] Update perf_overview.md for release 0.21 by @zbpatel in #6270
- [None][infra] Pin the version for triton to 3.3.1 (#6508) by @chzblych in #6519
New Contributors
- @jellysnack made their first contribution in #5214
Full Changelog: v0.21.0rc2...v0.21.0