Release v0.21.0 · NVIDIA/TensorRT-LLM

TensorRT-LLM Release 0.21.0

Key Features and Enhancements

Model Support
- Added Gemma3 VLM support
Features
- Added large-scale EP support
- Integrated NIXL into the communication layer of the disaggregated service
- Added fabric Memory support for KV Cache Transfer
- Added MCP in ScaffoldingLLM
- Added support for w4a8_mxfp4_fp8 quantization
- Added support for fp8 rowwise quantization
- Added generation logits support in TRTLLM Sampler
- Added log probs support in TRTLLM Sampler
- Optimized TRTLLM Sampler perf single beam single step
- Enabled Disaggregated serving for Qwen-3
- Added EAGLE3 support for Qwen-3
- Fused finalize and allreduce for Qwen-MoE model
- Refactored Fused MoE module
- Added support for chunked attention on Blackwell and Hopper
- Introduced sliding-window attention kernels for the generation phase on Blackwell
- Updated DeepSeek FP8 TRT-LLM Gen cubins to improve performance in large batch size scenarios
- Added FP8 block-scale GEMM support on SM89
- Enabled overlap scheduler between draft forwards
- Added Piecewise cuda graph support for MLA
- Added model-agnostic one-engine eagle3
- Enabled Finalize + Allreduce + add + rmsnorm fusion
- Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner
- Added support for Eagle3 + disaggregated serving in two model speculative decoding flow
- Validated Llama 3.1 models on H200 NVL
Benchmark:
- Added all_reduce.py benchmark script for testing
- Added beam width to trtllm-bench latency command
- Fixed trtllm-bench iter_stats and cuda_graph_batch_sizes errors
- Enabled trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA
- Supported post_proc for bench
- Added no_kv_cache_reuse option and streaming support for trtllm serve bench

Infrastructure Changes

The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.05-py3.
The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:25.05-py3.
The dependent public PyTorch version is updated to 2.7.1.
The dependent TensorRT version is updated to 10.11.
The dependent NVIDIA ModelOpt version is updated to 0.31.
The dependent NCCL version is updated to 2.27.5.

API Changes

Set _AutoDeployLlmArgs as primary config object
Removed decoder request from decoder interface
Enhanced the torch_compile_config in llm args
Removed the redundant use_kv_cache field from PytorchConfig
Moved allreduce_strategy from committed api to reference

Fixed Issues

Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
Fixed cuda graph padding for spec decoding (#4853)
Fixed llama 4 long context issue (#4809)
Fixed max_num_sequences calculation with overlap scheduling (#4532)
Fixed chunked prefill + overlap scheduling (#5761)
Fixed trtllm-bench hang issue due to LLM API IPC (#4798)
Fixed index out of bounds error in spec decoding (#5954)
Fixed MTP illegal memory access in cuda graph warmup (#5947)
Fixed no free slots error with spec decode + disagg (#5975)
Fixed one-off attention window size for Gemma3 1B (#5564)

Known Issues

accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken.
Enabling disaggregated serving, MTP, and the overlap scheduler at the same time can lead to accuracy problems.
In 0.21, full chunked attention support has been added to make sure LLaMA4 model can functionally run with > 8K seq length, while there is a known performance regression(only affect LLaMA4 model) on Hopper due to this functional enhancement. The root cause of the regression has been identified already and the fix will be part of the future release.

What's Changed

test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5221
[test] split nemotron test cases from examples_test_list by @crazydemo in #5238
Update DeepSeek R1 perf numbers to latest release/0.20 results by @litaotju in #5235
[feat] Add llm args to tune python gc threshold by @nv-yilinf in #5141
[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill by @tomeras91 in #5128
[TRTLLM-3456] Speculation: Draft Target in new FW by @IzzyPutterman in #4558
chore: Waive CI failure. by @SimengLiu-nv in #5252
[infra] Make test_chunked_prefill faster by @mikeiovine in #5248
Update internal cutlass commit. by @Tracin in #5228
test: add more pytorch cases in perf test by @ruodil in #5237
Fix: https://nvbugs/5345720 by @QiJune in #5259
test: [CI] remove closed bugs by @xinhe-nv in #5218
[TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP by @syuoni in #5215
fix mla test by @qsang-nv in #5240
doc: add document of benchmarking for Qwen3 by @byshiue in #5158
update setup.py for special cases by @qsang-nv in #5227
move some test cases of TensorRT backend back by @QiJune in #5232
[feat] Add EAGLE3 support for Qwen3 by @nv-yilinf in #5206
[TRTLLM-5786][https://nvbugspro.nvidia.com/bug/5310520][test] Add QA test cases by @crazydemo in #5073
CI: move multi-gpu test cases of tensorrt backend to h200 by @QiJune in #5272
refactor: Unify decoder test with e2e worklfow by @Funatiq in #5239
[feat] Piecewise cuda graph support for MLA by @liji-nv in #4467
chore: Mass integration of release/0.20 by @amirkl94 in #5082
[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner by @DomBrown in #5207
None - Some clean-ups for the automation pipeline by @chzblych in #5245
Re-implement LlmResponse in Python to reduce host overhead of pybind by @QiJune in #5224
delete cubins by @qsang-nv in #5274
infra[TRTLLM-5635] remove package stage in CI build by @niukuo in #5075
[Infra] - Update dependencies with NGC PyTorch 25.05 and TRT 10.11 by @EmmaQiaoCh in #4885
[chore] Remove BaseDraftTokenManager by @mikeiovine in #5251
[infra] Report CI authorization errors to PR by @tburt-nv in #5175
Revert "[infra] Report CI authorization errors to PR" by @tburt-nv in #5298
refactor: Update decoder buffer and logits management by @Funatiq in #4450
fix: only set _mpi_session if world_size is > 1 by @achartier in #5253
update LlmRequest.is_dummy property by @QiJune in #5283
test: update qa test list by @crazydemo in #5305
CI: extend model weights load time for dsv3 in stress test. by @dominicshanshan in #5275
[fix][test] move deepseek single gpu tests to post merge by @omera-nv in #5280
Waive L0 tests by @yiqingy0 in #5308
feat: Add non-streaming support for trtllm serve bench script & fixed prompt and output token length by @yizhang-nv in #4971
chore: partition LLM class into TorchLLM and TrtLLM by @Superjomn in #4900
[feat]: improve performance of XQA-MLA for sm120 by @lowsfer in #5087
doc:update contributing md for internal developers by @nv-guomingz in #5250
test: cherry-pick deepseek rcca cases in main branch by @ruodil in #5307
[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. by @hyukn in #5139
CI: fix TensorRT H200 tests by @QiJune in #5301
[TRTLLM-5758] test: Add Bielik-11B-v2.2 Model Support by @Wanli-Jiang in #5159
chore: Refine printed info of CHECK_TYPE. by @bobboli in #5295
refactor: Introduce ResourceManagerType enum for resource management by @Funatiq in #5246
chore: bump version to 0.21.0rc3 by @ZhanruiSunCh in #5309
test: correct unittest rerun behavior by @tongyuantongyu in #5273
Fix rerun step by @yiqingy0 in #5319
Waive L0 by @yizhang-nv in #5311
tests: add multi nodes tests by @xinhe-nv in #5196
feat: Add LLGuidance Support for PyTorch Backend by @jellysnack in #5214
[Infra]Update 5080 and 5090 case condition since we will upgrade driver by @EmmaQiaoCh in #5317
chore: Update README.md to expose meet-up info by @juney-nvidia in #5329
Remove duplicated test cases by @HuiGao-NV in #5323
Add disagg slurm scripts by @qiaoxj07 in #5243
Unwaive disaggregated serving accuracy tests by @Tabrizian in #5095
[feat] Multi-node CI testing support via Slurm by @yuanjingx87 in #4771
[fix][test] remove some cpp test cases from h100 by @omera-nv in #5335
[fix][test] remove duplicate test runs by @omera-nv in #5241
chore: skip test_llm_gpt2_medium_fp8 for fp8_pc_pt + quant_lm_head by @achartier in #5293
[fix][test] clear cuda cache before unittests automatically by @omera-nv in #5121
fix[nvbug5298640]: trtllm-llmapi-launch multiple LLM instances by @Superjomn in #4727
ci: Split long running jobs into multiple jobs by @Funatiq in #5268
[feat] Fusion finalize and allreduce for qwenmoe model by @zongfeijing in #5223
chore: remove torch_compile prefix for TorchCompileConfig field members by @nv-guomingz in #5261
[test] add nvfp4 DeepSeek-V3-Lite-mtp tests by @lfr-0531 in #5125
Waive L0 test by @yiqingy0 in #5349
chore: bump version to 0.21.0 by @yiqingy0 in #5325
tests: cherry-pick from main branch, add qwen3 test cases and amend test name in perf test by @ruodil in #5357
[Infra]cherry pick sanity check yml change for 5080 and 5090 from main by @EmmaQiaoCh in #5363
doc: cherry pick #5334 by @MartinMarciniszyn in #5368
fix: Fix skip by mpi size fixture by @yizhang-nv in #5355
Fix: missing clientId when serialize and deserialize response (cherry-pick #5231) by @kaiyux in #5378
tests: fix typos in qa test by @crazydemo in #5421
nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mistral Small multimodal for BS=8 by @brb-nv in #5453
feat: TRTLLM-5941 Upgrade xgrammar to 0.1.18 by @Wanli-Jiang in #5364
test: set enable_attention_dp=True in default deepseek settings by @ruodil in #5461
tests: Set kv cache free memory fraction in test case by @HuiGao-NV in #5462
[Infra] - Waive failed tests on release/0.21 by @EmmaQiaoCh in #5477
Fix permission for local user issues in NGC docker container. by @MartinMarciniszyn in #5373
[nvbug 5273941] fix: broken cyclic reference detect by @Superjomn in #5417
[nvbug/5354825] Fix nougat test image url by @amukkara in #5496
fix: fix regression in LOCAL_USER by @ixlmar in #5517
doc: Fix benchmark cmd in disagg scripts by @kaiyux in #5516
fix: constrain grepping in docker/Makefile by @ixlmar in #5493
[Infra][release/0.21] - waive failed tests by @EmmaQiaoCh in #5537
ci: unwaive llmapi launch test by @Superjomn in #5281
[TRTLLM-5989, TRTLLM-5991, TRTLLM-5993] doc: Update container instructions by @ixlmar in #5490
[cherry-pick] [CI] Waive test_fp8_block_scales_4gpus[ep4-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] by @venkywonka in #5553
[Infra][release/0.21]Update nccl to 2.27.5 by @EmmaQiaoCh in #5539
fix [nvbug5351244]: test_mpi_session submit sync/async by @Superjomn in #5608
fix:https://nvbugs/5362398 by @nv-guomingz in #5609
[nvbug 5300551] test: increase block count in eviction test by @zhengd-nv in #5465
test: add more tests for GB200 with 8 GPUs/2 nodes in L0 tests by @yizhang-nv in #5397
doc: Fix outdated config in DeepSeek best perf practice doc by @kaiyux in #5638
fix: [https://nvbugs/5355219] Fix bug of Qwen3 235B CI on dgx_gb200 by @byshiue in #5602
[https://nvbugspro.nvidia.com/bug/5351333][fix] Update to chunking calculation. by @FrankD412 in #5625
fix: Investigate Gemma3 1B decoder output discrepancy by @brb-nv in #5564
[Infra] - Waive failed cases on release/0.21 by @EmmaQiaoCh in #5674
Doc: Update invalid hugging face URLs by @Linda-Stadter in #5683
[NVBUG:5355009] Modify check for fuse_fp4_quant on SM120 by @farazkh80 in #5651
[TRTLLM-6100] fix: Nvbug 5356427: autotuned TRTLLM Gen fp8 block scale MoE illegal memory access by @DomBrown in #5676
[nvbug/5341178][fix] Fix OOM in Llama 4 accuracy test by @brb-nv in #5735
test: Move some of the test from post merge to pre-merge, update dgx b200 test case by @yizhang-nv in #5640
[5321981] fix: Fix the Llama3.1 405B hanging issue. by @hyukn in #5698
[Infra][nvbugs/5370968] - Unwaive l0 test by @yiqingy0 in #5750
[nvbugs/5336321][fix] Enable attention dp = False test case, Fix TRTLLM Gen Moe workspace allocation by @yizhang-nv in #5463
[nvbug/5337601][fix] Fix disagg + speculative decoding by @Tabrizian in #5558
[Infra] - Always use x86 image for the Jenkins agent by @chzblych in #5756
test: fix some test failure and add llama_nemotron models in perf sanity test, add more torch cases by @ruodil in #5693
fix: Skip rope scaling for local layers in Gemma3 VLM by @brb-nv in #5773
[nvbug 5004744][fix] rewrite completion API to avoid repetitive tokens by @LinPoly in #5201
fix _pad_attention_dp_dummy_request by @QiJune in #5583
Fix docker cache mount by @MartinMarciniszyn in #5763
[nvbug/5302638][nvbugs/5310314] fix _handle_cancelled_requests by @QiJune in #5532
cherry pick #5416 by @QiJune in #5776
[nvbug 5304752][fix] enhance _check_arguments to filter illegal requests for pytorch backend by @LinPoly in #5541
[nvbug5266240] chore: unwaive test_llm_with_dummy_weights by @Superjomn in #5744
[https://nvbugspro.nvidia.com/bug/5355054] fallback to cubins for fp8 fmha kernels on Ada. by @PerkzZheng in #5779
fix: [https://nvbugspro.nvidia.com/bug/5345215] Unwaive for bug 5345215. by @bobboli in #5606
[nvbugs/5326453] Avoid nesting NCCL grouping in allgather OP by @QiJune in #5789
fix: [https://nvbugs/5351130][https://nvbugs/5333654] Unwaive for bug 5351130 and 5333654. by @bobboli in #5821
doc: Update gb200 doc by @yizhang-nv in #5840
test: remove duplicate cases in perf sanity test by @ruodil in #5870
[nvbug 5327706][fix] fix mgmn postprocess error by @LinPoly in #5835
[nvbugs/5345391] fix: chunked prefill + overlap scheduling by @Funatiq in #5761
cherry-pick: [fix: nvbugs/5355493] Correctly clamp max sequence len to max attention window by @netanel-haber in #5874
[https://nvbugs/5355316] fix: update torch.compile option to fix triton store_cubin error by @dc3671 in #5865
test: Add Gemma3 unit tests to CI in release/0.21 by @brb-nv in #5899
tests: Fix lora perf test by @amirkl94 in #5875
fix: [nvbugs/5351130] Adjust DSV3-Lite tests free_gpu_memory_fraction to 0.75 to prevent OOM on CI. by @bobboli in #5896
chore: Port leftover 0.20 by @amirkl94 in #5907
fix [nvbug/5351244]: address remote mpi session submit by @Superjomn in #5664
fix: [5328141] increase tolerance for test_fp8_block_scale_gemm by @nekorobov in #5849
fix: timeout and broken pipe in disagg and worker tests by @zhengd-nv in #5827
[nvbugs/5333742] fix MTP illegal memory access in cuda graph warmup by @lfr-0531 in #5947
fix: fix index out of bounds error in spec decoding by @lfr-0531 in #5954
[nvbugs/5368410][fix] Disable moe allreduce for multi node by @yizhang-nv in #5918
[fix] Release slots with spec decode + disagg by @Tabrizian in #5975
[TRTLLM-6495] doc: add disclaimer for 3rd party software installation. by @nv-guomingz in #6039
[None] - Waive L0 tests by @yiqingy0 in #6082
Cherry Pick: PR #6076 by @ZhanruiSunCh in #6088
add release notes for 0.21 release by @QiJune in #6049
fix: Fix triton backend build [nvbug 5396469] by @pcastonguay in #6098
[None][infra] Cherry-pick #6128 and #6130 from main branch by @chzblych in #6151
[Doc][Qwen3] update qwen3 into support-matrix by @byshiue in #6161
[fix]: Revert commit 388b491 by @LinPoly in #6143
doc: update known issues by @QiJune in #6247
[fix] Cherry pick "[TRTLLM-6262] Fix Llama4 Scout FP4 crash issue" by @mikeiovine in #6267
doc: update release notes by @QiJune in #6324
test: Relax Gemma3 unit test thresholds by @brb-nv in #6016
tests: Add llama4 functional cases by @crazydemo in #6392
doc: update release notes by @QiJune in #6438
[https://nvbugspro.nvidia.com/bug/5415268] fix illegal smem access with chunked attention by @PerkzZheng in #6401
[doc] Update perf_overview.md for release 0.21 by @zbpatel in #6270
[None][infra] Pin the version for triton to 3.3.1 (#6508) by @chzblych in #6519

New Contributors

@jellysnack made their first contribution in #5214

Full Changelog: v0.21.0rc2...v0.21.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.21.0

TensorRT-LLM Release 0.21.0

Key Features and Enhancements

Infrastructure Changes

API Changes

Fixed Issues

Known Issues

What's Changed

New Contributors

Contributors

Uh oh!