v0.21.0 #6606

QiJune · 2025-08-04T14:23:52Z

QiJune
Aug 4, 2025
Maintainer

TensorRT-LLM Release 0.21.0

Key Features and Enhancements

Model Support
- Added Gemma3 VLM support
Features
- Added large-scale EP support
- Integrated NIXL into the communication layer of the disaggregated service
- Added fabric Memory support for KV Cache Transfer
- Added MCP in ScaffoldingLLM
- Added support for w4a8_mxfp4_fp8 quantization
- Added support for fp8 rowwise quantization
- Added generation logits support in TRTLLM Sampler
- Added log probs support in TRTLLM Sampler
- Optimized TRTLLM Sampler perf single beam single step
- Enabled Disaggregated serving for Qwen-3
- Added EAGLE3 support for Qwen-3
- Fused finalize and allreduce for Qwen-MoE model
- Refactored Fused MoE module
- Added support for chunked attention on Blackwell and Hopper
- Introduced sliding-window attention kernels for the generation phase on Blackwell
- Updated DeepSeek FP8 TRT-LLM Gen cubins to improve performance in large batch size scenarios
- Added FP8 block-scale GEMM support on SM89
- Enabled overlap scheduler between draft forwards
- Added Piecewise cuda graph support for MLA
- Added model-agnostic one-engine eagle3
- Enabled Finalize + Allreduce + add + rmsnorm fusion
- Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner
- Added support for Eagle3 + disaggregated serving in two model speculative decoding flow
- Validated Llama 3.1 models on H200 NVL
Benchmark:
- Added all_reduce.py benchmark script for testing
- Added beam width to trtllm-bench latency command
- Fixed trtllm-bench iter_stats and cuda_graph_batch_sizes errors
- Enabled trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA
- Supported post_proc for bench
- Added no_kv_cache_reuse option and streaming support for trtllm serve bench

Infrastructure Changes

The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.05-py3.
The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:25.05-py3.
The dependent public PyTorch version is updated to 2.7.1.
The dependent TensorRT version is updated to 10.11.
The dependent NVIDIA ModelOpt version is updated to 0.31.
The dependent NCCL version is updated to 2.27.5.

API Changes

Set _AutoDeployLlmArgs as primary config object
Removed decoder request from decoder interface
Enhanced the torch_compile_config in llm args
Removed the redundant use_kv_cache field from PytorchConfig
Moved allreduce_strategy from committed api to reference

Fixed Issues

Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (Fix: hang on disagg when MNNVL two-shot AllReduce is enabled #4678)
Fixed EP load balancer with MTP layer and route offset by EP rank (fix: large-scale EP - EP load balancer with MTP layer and route offset by EP rank #4767)
Fixed cuda graph padding for spec decoding (fix: fix cuda graph padding for spec decoding #4853)
Fixed llama 4 long context issue ([fix] Fix llama 4 long context #4809)
Fixed max_num_sequences calculation with overlap scheduling (fix: max_num_sequences calculation with overlap scheduling #4532)
Fixed chunked prefill + overlap scheduling ([nvbugs/5345391] fix: chunked prefill + overlap scheduling #5761)
Fixed trtllm-bench hang issue due to LLM API IPC (fix [nvbug5256044]: bench hang due to llmapi ipc #4798)
Fixed index out of bounds error in spec decoding (fix: fix index out of bounds error in spec decoding #5954)
Fixed MTP illegal memory access in cuda graph warmup ([nvbugs/5333742] fix MTP illegal memory access in cuda graph warmup #5947)
Fixed no free slots error with spec decode + disagg ([fix] Release slots with spec decode + disagg #5975)
Fixed one-off attention window size for Gemma3 1B (fix: Investigate Gemma3 1B decoder output discrepancy #5564)

Known Issues

accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken.
Enabling disaggregated serving, MTP, and the overlap scheduler at the same time can lead to accuracy problems.
In 0.21, full chunked attention support has been added to make sure LLaMA4 model can functionally run with > 8K seq length, while there is a known performance regression(only affect LLaMA4 model) on Hopper due to this functional enhancement. The root cause of the regression has been identified already and the fix will be part of the future release.

What's Changed

test: [CI] Add failed cases into waives.txt by @xinhe-nv in test: [CI] Add failed cases into waives.txt #5221
[test] split nemotron test cases from examples_test_list by @crazydemo in [test] split nemotron test cases from examples_test_list #5238
Update DeepSeek R1 perf numbers to latest release/0.20 results by @litaotju in Update DeepSeek R1 perf numbers to latest release/0.20 results #5235
[feat] Add llm args to tune python gc threshold by @nv-yilinf in [feat] Add llm args to tune python gc threshold #5141
[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill by @tomeras91 in [TRTLLM-5835][feat] Optimized Mamba2Mixer prefill #5128
[TRTLLM-3456] Speculation: Draft Target in new FW by @IzzyPutterman in [TRTLLM-3456] Speculation: Draft Target in new FW #4558
chore: Waive CI failure. by @SimengLiu-nv in chore: Waive CI failure. #5252
[infra] Make test_chunked_prefill faster by @mikeiovine in [infra] Make test_chunked_prefill faster #5248
Update internal cutlass commit. by @Tracin in Update internal cutlass commit. #5228
test: add more pytorch cases in perf test by @ruodil in test: add more pytorch cases in perf test #5237
Fix: https://nvbugs/5345720 by @QiJune in Fix: https://nvbugs/5345720 #5259
test: [CI] remove closed bugs by @xinhe-nv in test: [CI] remove closed bugs #5218
[TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP by @syuoni in [TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP #5215
fix mla test by @qsang-nv in fix mla test #5240
doc: add document of benchmarking for Qwen3 by @byshiue in doc: add document of benchmarking for Qwen3 #5158
update setup.py for special cases by @qsang-nv in update setup.py for special cases #5227
move some test cases of TensorRT backend back by @QiJune in move some test cases of TensorRT backend back #5232
[feat] Add EAGLE3 support for Qwen3 by @nv-yilinf in [feat] Add EAGLE3 support for Qwen3 #5206
[TRTLLM-5786][https://nvbugspro.nvidia.com/bug/5310520][test] Add QA test cases by @crazydemo in [TRTLLM-5786][https://nvbugspro.nvidia.com/bug/5310520][test] Add QA test cases #5073
CI: move multi-gpu test cases of tensorrt backend to h200 by @QiJune in CI: move multi-gpu test cases of tensorrt backend to h200 #5272
refactor: Unify decoder test with e2e worklfow by @Funatiq in refactor: Unify decoder test with e2e worklfow #5239
[feat] Piecewise cuda graph support for MLA by @liji-nv in [feat] Piecewise cuda graph support for MLA #4467
chore: Mass integration of release/0.20 by @amirkl94 in chore: Mass integration of release/0.20 #5082
[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner by @DomBrown in [TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner #5207
None - Some clean-ups for the automation pipeline by @chzblych in None - Some clean-ups for the automation pipeline #5245
Re-implement LlmResponse in Python to reduce host overhead of pybind by @QiJune in Re-implement LlmResponse in Python to reduce host overhead of pybind #5224
delete cubins by @qsang-nv in delete cubins #5274
infra[TRTLLM-5635] remove package stage in CI build by @niukuo in infra[TRTLLM-5635] remove package stage in CI build #5075
[Infra] - Update dependencies with NGC PyTorch 25.05 and TRT 10.11 by @EmmaQiaoCh in [Infra] - Update dependencies with NGC PyTorch 25.05 and TRT 10.11 #4885
[chore] Remove BaseDraftTokenManager by @mikeiovine in [chore] Remove BaseDraftTokenManager #5251
[infra] Report CI authorization errors to PR by @tburt-nv in [infra] Report CI authorization errors to PR #5175
Revert "[infra] Report CI authorization errors to PR" by @tburt-nv in Revert "[infra] Report CI authorization errors to PR" #5298
refactor: Update decoder buffer and logits management by @Funatiq in refactor: Update decoder buffer and logits management #4450
fix: only set _mpi_session if world_size is > 1 by @achartier in fix: only set _mpi_session if world_size is > 1 #5253
update LlmRequest.is_dummy property by @QiJune in update LlmRequest.is_dummy property #5283
test: update qa test list by @crazydemo in test: update qa test list #5305
CI: extend model weights load time for dsv3 in stress test. by @dominicshanshan in CI: extend model weights load time for dsv3 in stress test. #5275
[fix][test] move deepseek single gpu tests to post merge by @omera-nv in [fix][test] move deepseek single gpu tests to post merge #5280
Waive L0 tests by @yiqingy0 in Waive L0 tests #5308
feat: Add non-streaming support for trtllm serve bench script & fixed prompt and output token length by @yizhang-nv in feat: Add non-streaming support for trtllm serve bench script & fixed prompt and output token length #4971
chore: partition LLM class into TorchLLM and TrtLLM by @Superjomn in chore: partition LLM class into TorchLLM and TrtLLM #4900
[feat]: improve performance of XQA-MLA for sm120 by @lowsfer in [feat]: improve performance of XQA-MLA for sm120 #5087
doc:update contributing md for internal developers by @nv-guomingz in doc:update contributing md for internal developers #5250
test: cherry-pick deepseek rcca cases in main branch by @ruodil in test: cherry-pick deepseek rcca cases in main branch #5307
[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. by @hyukn in [TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. #5139
CI: fix TensorRT H200 tests by @QiJune in CI: fix TensorRT H200 tests #5301
[TRTLLM-5758] test: Add Bielik-11B-v2.2 Model Support by @Wanli-Jiang in [TRTLLM-5758] test: Add Bielik-11B-v2.2 Model Support #5159
chore: Refine printed info of CHECK_TYPE. by @bobboli in chore: Refine printed info of CHECK_TYPE. #5295
refactor: Introduce ResourceManagerType enum for resource management by @Funatiq in refactor: Introduce ResourceManagerType enum for resource management #5246
chore: bump version to 0.21.0rc3 by @ZhanruiSunCh in chore: bump version to 0.21.0rc3 #5309
test: correct unittest rerun behavior by @tongyuantongyu in test: correct unittest rerun behavior #5273
Fix rerun step by @yiqingy0 in Fix rerun step #5319
Waive L0 by @yizhang-nv in Waive L0 #5311
tests: add multi nodes tests by @xinhe-nv in tests: add multi nodes tests #5196
feat: Add LLGuidance Support for PyTorch Backend by @jellysnack in feat: Add LLGuidance Support for PyTorch Backend #5214
[Infra]Update 5080 and 5090 case condition since we will upgrade driver by @EmmaQiaoCh in [Infra]Update 5080 and 5090 case condition since we will upgrade driver #5317
chore: Update README.md to expose meet-up info by @juney-nvidia in chore: Update README.md to expose meet-up info #5329
Remove duplicated test cases by @HuiGao-NV in Remove duplicated test cases #5323
Add disagg slurm scripts by @qiaoxj07 in Add disagg slurm scripts #5243
Unwaive disaggregated serving accuracy tests by @Tabrizian in Unwaive disaggregated serving accuracy tests #5095
[feat] Multi-node CI testing support via Slurm by @yuanjingx87 in [feat] Multi-node CI testing support via Slurm #4771
[fix][test] remove some cpp test cases from h100 by @omera-nv in [fix][test] remove some cpp test cases from h100 #5335
[fix][test] remove duplicate test runs by @omera-nv in [fix][test] remove duplicate test runs #5241
chore: skip test_llm_gpt2_medium_fp8 for fp8_pc_pt + quant_lm_head by @achartier in chore: skip test_llm_gpt2_medium_fp8 for fp8_pc_pt + quant_lm_head #5293
[fix][test] clear cuda cache before unittests automatically by @omera-nv in [fix][test] clear cuda cache before unittests automatically #5121
fix[nvbug5298640]: trtllm-llmapi-launch multiple LLM instances by @Superjomn in fix[nvbug5298640]: trtllm-llmapi-launch multiple LLM instances #4727
ci: Split long running jobs into multiple jobs by @Funatiq in ci: Split long running jobs into multiple jobs #5268
[feat] Fusion finalize and allreduce for qwenmoe model by @zongfeijing in [feat] Fusion finalize and allreduce for qwenmoe model #5223
chore: remove torch_compile prefix for TorchCompileConfig field members by @nv-guomingz in chore: remove torch_compile prefix for TorchCompileConfig field members #5261
[test] add nvfp4 DeepSeek-V3-Lite-mtp tests by @lfr-0531 in [test] add nvfp4 DeepSeek-V3-Lite-mtp tests #5125
Waive L0 test by @yiqingy0 in Waive L0 test #5349
chore: bump version to 0.21.0 by @yiqingy0 in chore: bump version to 0.21.0 #5325
tests: cherry-pick from main branch, add qwen3 test cases and amend test name in perf test by @ruodil in tests: cherry-pick from main branch, add qwen3 test cases and amend test name in perf test #5357
[Infra]cherry pick sanity check yml change for 5080 and 5090 from main by @EmmaQiaoCh in [Infra]cherry pick sanity check yml change for 5080 and 5090 from main #5363
doc: cherry pick doc: Include NGC release containers in quick-start-guide.md #5334 by @MartinMarciniszyn in doc: cherry pick #5334 #5368
fix: Fix skip by mpi size fixture by @yizhang-nv in fix: Fix skip by mpi size fixture #5355
Fix: missing clientId when serialize and deserialize response (cherry-pick Fix: missing clientId when serialize and deserialize response #5231) by @kaiyux in Fix: missing clientId when serialize and deserialize response (cherry-pick #5231) #5378
tests: fix typos in qa test by @crazydemo in tests: fix typos in qa test #5421
nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mistral Small multimodal for BS=8 by @brb-nv in nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mistral Small multimodal for BS=8 #5453
feat: TRTLLM-5941 Upgrade xgrammar to 0.1.18 by @Wanli-Jiang in feat: TRTLLM-5941 Upgrade xgrammar to 0.1.18 #5364
test: set enable_attention_dp=True in default deepseek settings by @ruodil in test: set enable_attention_dp=True in default deepseek settings #5461
tests: Set kv cache free memory fraction in test case by @HuiGao-NV in tests: Set kv cache free memory fraction in test case #5462
[Infra] - Waive failed tests on release/0.21 by @EmmaQiaoCh in [Infra] - Waive failed tests on release/0.21 #5477
Fix permission for local user issues in NGC docker container. by @MartinMarciniszyn in Fix permission for local user issues in NGC docker container. #5373
[nvbug 5273941] fix: broken cyclic reference detect by @Superjomn in [nvbug 5273941] fix: broken cyclic reference detect #5417
[nvbug/5354825] Fix nougat test image url by @amukkara in [nvbug/5354825] Fix nougat test image url #5496
fix: fix regression in LOCAL_USER by @ixlmar in fix: fix regression in LOCAL_USER #5517
doc: Fix benchmark cmd in disagg scripts by @kaiyux in doc: Fix benchmark cmd in disagg scripts #5516
fix: constrain grepping in docker/Makefile by @ixlmar in fix: constrain grepping in docker/Makefile #5493
[Infra][release/0.21] - waive failed tests by @EmmaQiaoCh in [Infra][release/0.21] - waive failed tests #5537
ci: unwaive llmapi launch test by @Superjomn in ci: unwaive llmapi launch test #5281
[TRTLLM-5989, TRTLLM-5991, TRTLLM-5993] doc: Update container instructions by @ixlmar in [TRTLLM-5989, TRTLLM-5991, TRTLLM-5993] doc: Update container instructions #5490
[cherry-pick] [CI] Waive test_fp8_block_scales_4gpus[ep4-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] by @venkywonka in [cherry-pick] [CI] Waive test_fp8_block_scales_4gpus[ep4-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] #5553
[Infra][release/0.21]Update nccl to 2.27.5 by @EmmaQiaoCh in [Infra][release/0.21]Update nccl to 2.27.5 #5539
fix [nvbug5351244]: test_mpi_session submit sync/async by @Superjomn in fix [nvbug5351244]: test_mpi_session submit sync/async #5608
fix:https://nvbugs/5362398 by @nv-guomingz in fix:https://nvbugs/5362398 #5609
[nvbug 5300551] test: increase block count in eviction test by @zhengd-nv in [nvbug 5300551] test: increase block count in eviction test #5465
test: add more tests for GB200 with 8 GPUs/2 nodes in L0 tests by @yizhang-nv in test: add more tests for GB200 with 8 GPUs/2 nodes in L0 tests #5397
doc: Fix outdated config in DeepSeek best perf practice doc by @kaiyux in doc: Fix outdated config in DeepSeek best perf practice doc #5638
fix: [https://nvbugs/5355219] Fix bug of Qwen3 235B CI on dgx_gb200 by @byshiue in fix: [https://nvbugs/5355219] Fix bug of Qwen3 235B CI on dgx_gb200 #5602
[https://nvbugspro.nvidia.com/bug/5351333][fix] Update to chunking calculation. by @FrankD412 in [https://nvbugspro.nvidia.com/bug/5351333][fix] Update to chunking calculation. #5625
fix: Investigate Gemma3 1B decoder output discrepancy by @brb-nv in fix: Investigate Gemma3 1B decoder output discrepancy #5564
[Infra] - Waive failed cases on release/0.21 by @EmmaQiaoCh in [Infra] - Waive failed cases on release/0.21 #5674
Doc: Update invalid hugging face URLs by @Linda-Stadter in Doc: Update invalid hugging face URLs #5683
[NVBUG:5355009] Modify check for fuse_fp4_quant on SM120 by @farazkh80 in [NVBUG:5355009] Modify check for fuse_fp4_quant on SM120 #5651
[TRTLLM-6100] fix: Nvbug 5356427: autotuned TRTLLM Gen fp8 block scale MoE illegal memory access by @DomBrown in [TRTLLM-6100] fix: Nvbug 5356427: autotuned TRTLLM Gen fp8 block scale MoE illegal memory access #5676
[nvbug/5341178][fix] Fix OOM in Llama 4 accuracy test by @brb-nv in [nvbug/5341178][fix] Fix OOM in Llama 4 accuracy test #5735
test: Move some of the test from post merge to pre-merge, update dgx b200 test case by @yizhang-nv in test: Move some of the test from post merge to pre-merge, update dgx b200 test case #5640
[5321981] fix: Fix the Llama3.1 405B hanging issue. by @hyukn in [5321981] fix: Fix the Llama3.1 405B hanging issue. #5698
[Infra][nvbugs/5370968] - Unwaive l0 test by @yiqingy0 in [Infra][nvbugs/5370968] - Unwaive l0 test #5750
[nvbugs/5336321][fix] Enable attention dp = False test case, Fix TRTLLM Gen Moe workspace allocation by @yizhang-nv in [nvbugs/5336321][fix] Enable attention dp = False test case, Fix TRTLLM Gen Moe workspace allocation #5463
[nvbug/5337601][fix] Fix disagg + speculative decoding by @Tabrizian in [nvbug/5337601][fix] Fix disagg + speculative decoding #5558
[Infra] - Always use x86 image for the Jenkins agent by @chzblych in [Infra] - Always use x86 image for the Jenkins agent #5756
test: fix some test failure and add llama_nemotron models in perf sanity test, add more torch cases by @ruodil in test: fix some test failure and add llama_nemotron models in perf sanity test, add more torch cases #5693
fix: Skip rope scaling for local layers in Gemma3 VLM by @brb-nv in fix: Skip rope scaling for local layers in Gemma3 VLM #5773
[nvbug 5004744][fix] rewrite completion API to avoid repetitive tokens by @LinPoly in [nvbug 5004744][fix] rewrite completion API to avoid repetitive tokens #5201
fix _pad_attention_dp_dummy_request by @QiJune in fix _pad_attention_dp_dummy_request #5583
Fix docker cache mount by @MartinMarciniszyn in Fix docker cache mount #5763
[nvbug/5302638][nvbugs/5310314] fix _handle_cancelled_requests by @QiJune in [nvbug/5302638][nvbugs/5310314] fix _handle_cancelled_requests #5532
cherry pick Fix test Pytorch model engine #5416 by @QiJune in cherry pick #5416 #5776
[nvbug 5304752][fix] enhance _check_arguments to filter illegal requests for pytorch backend by @LinPoly in [nvbug 5304752][fix] enhance _check_arguments to filter illegal requests for pytorch backend #5541
[nvbug5266240] chore: unwaive test_llm_with_dummy_weights by @Superjomn in [nvbug5266240] chore: unwaive test_llm_with_dummy_weights #5744
[https://nvbugspro.nvidia.com/bug/5355054] fallback to cubins for fp8 fmha kernels on Ada. by @PerkzZheng in [https://nvbugspro.nvidia.com/bug/5355054] fallback to cubins for fp8 fmha kernels on Ada. #5779
fix: [https://nvbugspro.nvidia.com/bug/5345215] Unwaive for bug 5345215. by @bobboli in fix: [https://nvbugspro.nvidia.com/bug/5345215] Unwaive for bug 5345215. #5606
[nvbugs/5326453] Avoid nesting NCCL grouping in allgather OP by @QiJune in [nvbugs/5326453] Avoid nesting NCCL grouping in allgather OP #5789
fix: [https://nvbugs/5351130][https://nvbugs/5333654] Unwaive for bug 5351130 and 5333654. by @bobboli in fix: [https://nvbugs/5351130][https://nvbugs/5333654] Unwaive for bug 5351130 and 5333654. #5821
doc: Update gb200 doc by @yizhang-nv in doc: Update gb200 doc #5840
test: remove duplicate cases in perf sanity test by @ruodil in test: remove duplicate cases in perf sanity test #5870
[nvbug 5327706][fix] fix mgmn postprocess error by @LinPoly in [nvbug 5327706][fix] fix mgmn postprocess error #5835
[nvbugs/5345391] fix: chunked prefill + overlap scheduling by @Funatiq in [nvbugs/5345391] fix: chunked prefill + overlap scheduling #5761
cherry-pick: [fix: nvbugs/5355493] Correctly clamp max sequence len to max attention window by @netanel-haber in cherry-pick: [fix: nvbugs/5355493] Correctly clamp max sequence len to max attention window #5874
[https://nvbugs/5355316] fix: update torch.compile option to fix triton store_cubin error by @dc3671 in [https://nvbugs/5355316] fix: update torch.compile option to fix triton store_cubin error #5865
test: Add Gemma3 unit tests to CI in release/0.21 by @brb-nv in test: Add Gemma3 unit tests to CI in release/0.21 #5899
tests: Fix lora perf test by @amirkl94 in tests: Fix lora perf test #5875
fix: [nvbugs/5351130] Adjust DSV3-Lite tests free_gpu_memory_fraction to 0.75 to prevent OOM on CI. by @bobboli in fix: [nvbugs/5351130] Adjust DSV3-Lite tests free_gpu_memory_fraction to 0.75 to prevent OOM on CI. #5896
chore: Port leftover 0.20 by @amirkl94 in chore: Port leftover 0.20 #5907
fix [nvbug/5351244]: address remote mpi session submit by @Superjomn in fix [nvbug/5351244]: address remote mpi session submit #5664
fix: [5328141] increase tolerance for test_fp8_block_scale_gemm by @nekorobov in fix: [5328141] increase tolerance for test_fp8_block_scale_gemm #5849
fix: timeout and broken pipe in disagg and worker tests by @zhengd-nv in fix: timeout and broken pipe in disagg and worker tests #5827
[nvbugs/5333742] fix MTP illegal memory access in cuda graph warmup by @lfr-0531 in [nvbugs/5333742] fix MTP illegal memory access in cuda graph warmup #5947
fix: fix index out of bounds error in spec decoding by @lfr-0531 in fix: fix index out of bounds error in spec decoding #5954
[nvbugs/5368410][fix] Disable moe allreduce for multi node by @yizhang-nv in [nvbugs/5368410][fix] Disable moe allreduce for multi node #5918
[fix] Release slots with spec decode + disagg by @Tabrizian in [fix] Release slots with spec decode + disagg #5975
[TRTLLM-6495] doc: add disclaimer for 3rd party software installation. by @nv-guomingz in [TRTLLM-6495] doc: add disclaimer for 3rd party software installation. #6039
[None] - Waive L0 tests by @yiqingy0 in [None] - Waive L0 tests #6082
Cherry Pick: PR [fix] Fix Triton build #6076 by @ZhanruiSunCh in Cherry Pick: PR #6076 #6088
add release notes for 0.21 release by @QiJune in add release notes for 0.21 release #6049
fix: Fix triton backend build [nvbug 5396469] by @pcastonguay in fix: Fix triton backend build [nvbug 5396469] #6098
[None][infra] Cherry-pick [None][infra] Set up the initial config for CodeRabbit #6128 and [Infra] - Add wiave list for pytest when using slurm #6130 from main branch by @chzblych in [None][infra] Cherry-pick #6128 and #6130 from main branch #6151
[Doc][Qwen3] update qwen3 into support-matrix by @byshiue in [Doc][Qwen3] update qwen3 into support-matrix #6161
[fix]: Revert commit 388b491 by @LinPoly in [fix]: Revert commit 388b491 #6143
doc: update known issues by @QiJune in doc: update known issues #6247
[fix] Cherry pick "[TRTLLM-6262] Fix Llama4 Scout FP4 crash issue" by @mikeiovine in [fix] Cherry pick "[TRTLLM-6262] Fix Llama4 Scout FP4 crash issue" #6267
doc: update release notes by @QiJune in doc: update release notes #6324
test: Relax Gemma3 unit test thresholds by @brb-nv in test: Relax Gemma3 unit test thresholds #6016
tests: Add llama4 functional cases by @crazydemo in tests: Add llama4 functional cases #6392
doc: update release notes by @QiJune in doc: update release notes #6438
[https://nvbugspro.nvidia.com/bug/5415268] fix illegal smem access with chunked attention by @PerkzZheng in [https://nvbugspro.nvidia.com/bug/5415268] fix illegal smem access with chunked attention #6401
[doc] Update perf_overview.md for release 0.21 by @zbpatel in [doc] Update perf_overview.md for release 0.21 #6270
[None][infra] Pin the version for triton to 3.3.1 ([None][infra] Pin the version for triton to 3.3.1 #6508) by @chzblych in [None][infra] Pin the version for triton to 3.3.1 (#6508) #6519

New Contributors

@jellysnack made their first contribution in feat: Add LLGuidance Support for PyTorch Backend #5214

Full Changelog: v0.21.0rc2...v0.21.0

This discussion was created from the release v0.21.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.21.0 #6606

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

v0.21.0 #6606

Uh oh!

Uh oh!

QiJune Aug 4, 2025 Maintainer

TensorRT-LLM Release 0.21.0

Key Features and Enhancements

Infrastructure Changes

API Changes

Fixed Issues

Known Issues

What's Changed

New Contributors

Replies: 0 comments

QiJune
Aug 4, 2025
Maintainer