Release v1.2.0rc5 · NVIDIA/TensorRT-LLM

Announcement Highlights

Vulnerability
- Two security vulnerabilities have been identified in the urllib3 package versions >= 1.24 and < 2.6.0. These issues will be addressed in the next release. For detailed information on the vulnerabilities, refer to the following advisories:
  GHSA-gm62-xv2j-4w53
  GHSA-2xpw-w6gg-jr37
  To mitigate the issues immediately, users are advised to upgrade urllib3 to version 2.6.0 or later.
Model Support
- Slimmed down implementation of Nemotron H (#9235)
- Add support Starcoder2 PyTorch backend (#8923)
- Add support MLA chunked prefill for DeepSeek V3.2 model (#9376)
- Add support AutoDeploy Nemotron-Flash (#9504)
- AutoDeploy: Add support Llama4 MoE handling (#9556)
- Add support for nano-v3 and super-v3 with PyTorch backend (#9261)
- AutoDeploy: Add support for nano v3 to custom implementation (#9465)
API
- Add revision option to trtllm commands (#9498)
- Add support to override env vars in LLM API (#9104)
- Support Response API for general purpose (#9392)
Feature
- Add support for KVCache reuse for DeepSeek V3.2 (#9383)
- Support Yarn on QwQ-32B model (#9059)
- Update DeepGEMM to include optimizations for DeepSeek-v3.2 (#9380)
- Cold L2 cache when doing autotune benchmarking (#8779)
- Improve TRTLLM MoE throughput for small hidden size (#9377)
- Add parser to layer-wise benchmarks (#9440)
- Support custom chat template for tool calling (#9297)
- Add draft token tree runtime on CDL (#8586)
- Top-p optimization by removing redundant softmax (#9411)
- Use FlashInfer's top_k_sampling_from_probs (#9457)
- Overlap context chunks in pipeline parallel mode (#9308)
- Improve all-to-all perf for large CP size in Helix (#9494)
- Support more accurate AR calculation (#9323)
- Support custom config of sharding (#9143)
- Integrate helix parallelism (#9342)
- Optimize RocketKV algorithm (#9333)
- Extend cute_dsl_nvfp4_gemm to sm103 (#9543)
- Add chat template kwargs support to longbench-v2 (#9544)
- Add Beam Search to TorchSampler (#8509)
- Unify nvfp4 gemm backend (#8963)
- Use FlashInfer.sampling by default (#9545)
- Add RocketKV usage doc and e2e accuracy test on LongBenchV2 (#9572)
- Alias to comply to LlmArgs (#9586)
- Update trtllm-gen nvfp4 kernels with better performance (#9510)
- Enable CuteDSL MoE with Large EP (#9592)
- Convert cuteDSL GEMM to opt-in feature (#9682)
- Optimize the load_weights method to include mapping parameter (#9583)
- Support torch compile for pipeline parallel Llama and DeepSeekV3 (#7838)
- Check if executor is shutdown in /health entrypoint (#9057)
- Add NIXL-LIBFABRIC support (#9225)
- Decouple disagg service from FastAPI (#8714)
- AutoDeploy: Add NVFP4 Cutlass MoE kernels (#9551)
- AutoDeploy: Draft Target Speculative Decoding (#9275)
- AutoDeploy: Support TRTLLM Sampler (#9641)
- AutoDeploy: Perf optimization for Attention and rmsnorm (#9719)
- AutoDeploy: Use router gemm op for Nemotron MOE (#9500)
- AutoDeploy: Remove redundant copies in mamba layers (#9461)
- AutoDeploy: Add A_log fusion for Mamba layers (#9422)
- AutoDeploy: Update dist ops (if not already) (#9301)
- AutoDeploy: Perf optimization entries (if not already in Feature) (#9719)
Fix
- Modify qwen3-next sampling stop_tokens (#9331)
- Fix mismatched nvfp4 gemm sf shape (#9336)
- Enhance warning in cacheTransBuffer (#9390)
- Fix top-k outIndices with vectorized_process (#9404)
- Let KV cache manager block initialization respect dry run (#9093)
- Avoid cudaFree overlap with cuda graph (#9438)
- Fix TP support for DeepSeek-V3.2 on Hopper (#9484)
- Fix Qwen3-235B ATP accuracy issue with PDL (#9530)
- Correct virtual memory allocation alignment (#9491)
- Fix view operation on uncontiguous tensor (#9576)
- Extract GPU count from single-node stage names (#9599)
- Refine Piecewise Cuda Graph condition for DP (#9393)
- Enhance RPC robustness (#8711)
- Fix synchronization bugs in KvCacheTransferManager preventing corrupted blocks (#9056)
- Fix dist-serving performance by clearing CPU affinity (#9549)
- Fix wide ep MoE error (#9642)
- Fix LoRa enablement for GPT OSS Torch (#8253)
- Recover TRTLLM MoE performance for DEP (#9562)
- Fix error when processing batches containing both text and multimodal data (#8381)
- Fix deepseek_fp8_block_scales using 2D x_sf in TRTLLMGEN-MoE (#9658)
- Enable hmac in RPC (#9745)
- Start disagg workers and servers on free ports (#9694)
- Fix bug: deepseek_fp8_block_scales uses 2D x_sf instead of 1D (#9658)
- AutoDeploy: fix nano sharding config (#9668)
- AutoDeploy: Remove auto-tuner from nvfp4_gemm forward (#9497)
Documentation
- Fix math formula rendering issues (#9481)
- Qwen3 deployment guide (#9488)
- KV Connector Docs (#9325)
- Deployment Guide for Kimi K2 Thinking on TensorRT LLM - Blackwell (#9711)
- Add feature docs for helix parallelism (#9684)
- Add examples showcasing OpenAI compatible APIs (#9520)
- Update Linux installation guide (#9485)
- Refine the slurm examples (#9548)
- Link to modelopt checkpoints in quick start guide (#9571)
Test & Infra
- Rename AlltoAll backend names (#9329)
- Move build config from BaseLlmArgs to TrtLlmArgs (#9249)
- Reduce nested nvtx ranges (#9347)
- Add disagg and wideep multi-node multi-gpu test cases (#9356)
- Upgrade CuteDSL to 4.3.0 (#9444)
- Use flexcache for gh200 nodes (#9405)
- Evaluate helix parallelism with DSV3 Lite (#9597)
- AutoDeploy update cuda stream manager for multi-device (#9575)
- Add container notices and documentation (#9185)
- Increase warmup times in multi-gpu testing (#9578)

What's Changed

[#9316][feat] AutoDeploy: Add the accuracy test for Nemotron MOE models by @nvchenghaoz in #9317
[#9096][feature] Auto Deploy: configurable fused MoE backend by @nzmora-nvidia in #9194
[None][fix] Use fp32 for indexer weight_proj GEMM by @chang-l in #9243
[None][fix] Multimodal InputProcessor dummy builder fix by @yechank-nvidia in #8916
[None][ci] waive test_disagg_server_restart by @QiJune in #9326
[None][chore] Revise the description of enable_autotuner. by @hyukn in #9320
[TRTLLM-9295][fix] use greedy decoding in test_openai_compatible_json_schema by @ixlmar in #9305
[TRTLLM-9164][infra] Enable checking duplicate items in waives.txt in pre-commit by @EmmaQiaoCh in #9265
[#9236][feature] Make sharing of activation_type across SW layers more robust by @nzmora-nvidia in #9238
[https://nvbugs/5667687][fix] Set correct lm_head_tp_size_upper_bound by @lancelly in #9300
[https://nvbugs/5667454][test] Fix Test Case as Chunked Attention not Supported on sm_120 by @yufeiwu-nv in #9260
[None][chore] Weekly mass integration of release/1.1 by @mikeiovine in #8918
[None][chore] Upgrade starlette and FastAPI by @tburt-nv in #9319
[None][infra] Update goggles_action repository by @karljang in #9240
[TRTLLM-9197][infra] Move thirdparty stuff to it's own listfile by @cheshirekow in #8986
[TRI-332] [fix] Fix L0_backend_trtllm by @yinggeh in #9282
[None][ci] waive test_llm_context_only_timed_out_kv_cache_exhausted by @QiJune in #9351
[None][infra] Add fallback when get wheel from build stage is fail by @ZhanruiSunCh in #9290
[TRTLLM-9183][infra] Add --waives-file in rerun pytest command by @yiqingy0 in #8971
[TRTLLM-8957][feat] create communication related classes by @xxi-nv in #8968
[None][chore] Add periodic junit xml path in conftest by @crazydemo in #9337
[None][ci] waive a test case of test_ad_build_small_multi.py by @QiJune in #9355
[None][infra] Waive failed cases in main post-merge on 11/21 by @EmmaQiaoCh in #9360
[None][chore] Bump version to 1.2.0rc4 by @yiqingy0 in #9363
[TRTLLM-8650][fix] beam search request validation (#8433) by @ixlmar in #9228
[TRTLLM-9191][feat] support out-of-tree models in trtllm-serve by @ixlmar in #9269
[https://nvbugs/5629833][fix] Don't fill tensors by @HuiGao-NV in #9296
[None][feat] TRT-LLM Gen MoE optimize DeepSeek Fp8 activation kernel by @nekorobov in #9175
[https://nvbugs/5590408][fix] Fallback to greedy sampling in two-model overlap scheduler by @ziyixiong-nv in #9321
[TRTLLM-9208][infra] Document the process for C++ deps by @cheshirekow in #9016
[TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) by @syuoni in #9288
[None][feat] Eagle: PostNorm and multilayer options by @IzzyPutterman in #9233
[TRTLLM-9082][feat] AutoDeploy: Move the moe Align kernel to AOT by @nvchenghaoz in #9106
[#9388][fix] AutoDeploy: Fix cutlass BF16 MoE kernel invocation by @nzmora-nvidia in #9339
[TRTINFRA-7326][infra] - Consume SlurmCluster sshPort for clusters with custom SSH port by @mlefeb01 in #9313
[None][test] Add one-model and overlap-scheduling to eagle tests for GPTOSS by @dongfengy in #9312

Full Changelog: v1.2.0rc3...v1.2.0rc4

What's Changed

[None][ci] waive two ray tests by @Superjomn in #9375
[#9230][feat] Slimmed down implementation of nemotron H by @2ez4bz in #9235
[None][fix] modify qwen3-next sampling stop_tokens by @JadoTu in #9331
[None][chore] AutoDeploy: Add the Nemotron MOE to CI by @nvchenghaoz in #9328
[TRTLLM-9389][chore] Rename AlltoAll backend names by @bobboli in #9329
[TRTLLM-7963][fix] Several improvements of autotuning quality by @hyukn in #9348
[TRTLLM-9302][chore] Move build config from BaseLlmArgs to TrtLlmArgs by @QiJune in #9249
[https://nvbugs/5637012][fix] Fix helix unit tests by @brb-nv in #9369
[https://nvbugs/5676748][fix] Fix mismatched nvfp4 gemm sf shape. by @hyukn in #9336
[None][chore] Remove unnecessary log in the short tuning profile by @hyukn in #9387
[None][infra] Waive failed cases on main branch by @EmmaQiaoCh in #9384
[TRTLLM-9211][infra] Minor fixes to 3rdparty/CMakelists by @cheshirekow in #9365
[TRTLLM-9299][infra] Add third-party docs for python by @cheshirekow in #9366
[None][infra] Waive failed cases for main by @EmmaQiaoCh in #9400
[None][fix] enhance warning in cacheTransBuffer by @chuangz0 in #9390
[None][fix] Fix topk outIndices when using vectorized_process by @yweng0828 in #9404
[TRTLLM-7967][feat] Adding Starcoder2 PyTorch Backend Support by @yibinl-nvidia in #8923
[None][feat] Support Yarn on QwQ-32B model by @byshiue in #9059
[https://nvbugs/5685428][fix] fix test_openai_chat_multimodal.py by @QiJune in #9406
[TRTLLM-8777][feat] Update DeepGEMM to the latest commit to include optimizations for DeepSeek-v3.2 by @lfr-0531 in #9380
[None][chore] Reduce nested nvtx ranges. by @yuxianq in #9347
[None][chore] Remove closed bugs by @xinhe-nv in #9381
[None][chore] unwaive ampere kernels test by @kris1025 in #9389
[#9271][perf] Enable multi-stream MOE optimization in AutoDeploy by @suyoggupta in #9322
[#9413][fix] Minor fixes to nemotron H and custom models in AD by @2ez4bz in #9416
[TRTLLM-7963][feat] Cold L2 cache when doing autotune benchmarking. by @hyukn in #8779
[None][infra] Waive failed cases for main branch on 11/25 by @EmmaQiaoCh in #9429
[#8391][chore] test_perf.py to lock clocks read from gpu_configs.yml instead of max freq by @MrGeva in #9409
[None][ci] Move more test stages to use OCI machines by @chzblych in #9395
[None][feat] Improve TRTLLM MoE in small hidden size throughput cases by @rosenrodt in #9377
[https://nvbugs/5537996][fix] Let KV cache manager block initialization be aware whether it is doing a dry run or not by @eopXD in #9093
[https://nvbugs/5667922][fix] Update long context evaluation config by @baize97 in #9426
[http://nvbugs/5608930][fix] Mitigate test timeout issues by @Shixiaowei02 in #9445
[None][chore] Fix trtllm-eval for PyTorchLLM by @lfr-0531 in #9427
[None][feat] Add a parser to layer-wise benchmarks by @yuantailing in #9440
[None][feat] Support custom chat template for tool calling by @LinPoly in #9297
[TRTLLM-8160][feat] Add draft token tree runtime on CDL by @yweng0828 in #8586
[None][ci] waive a test by @Superjomn in #9458
[https://nvbugs/5680905][fix] Relax the MMLU accuracy requirement for DS-v3.2 by @lfr-0531 in #9439
[TRTLLM-8376][feat] top-p optimization (removes redundant softmax) by @ixlmar in #9411
[TRTLLM-9490][feat] use FlashInfer's top_k_sampling_from_probs by @ixlmar in #9457
[https://nvbugs/5647400] [fix] Enlarged the AllReduce workspace size to 64MB. Added AllReduce strategy to AD config. by @MrGeva in #9145
[TRTLLM-909][feat] Overlap context chunks in pipeline parallel mode by @Funatiq in #9308
[None][chore] AutoDeploy add multi stream moe pass to default.yaml by @suyoggupta in #9430
[https://nvbugs/5685143][fix] avoid cudaFree overlap with cuda graph by @chuangz0 in #9438
[None][chore] Bump version to 1.2.0rc5 by @yiqingy0 in #9455
[TRTLLM-8936][test] Add disagg and wideep multi-node multi-gpu test cases by @fredricz-20070104 in #9356
[None][ci] move some slow test cases of DGX-B200 to post merge by @QiJune in #9467
[TRTLLM-9293][feat] Enable partial weight loading to support streaming update weights by @shuyixiong in #9224
[TRTLLM-9264][fix] Add accuracy/unit tests/doc for phi4mm by @Wanli-Jiang in #9246
[https://nvbugs/5580099][fix] Cherry pick IMA issue fix from release/1.1 by @JunyiXu-nv in #9032
[None][chore] Upgrade CuteDSL to 4.3.0 by @syuoni in #9444
[None][feat] Support MLA chunked prefill for DeepSeek V3.2 model by @chang-l in #9376
[None][feat] Add environment variable to force spec-dec number of accepted tokens by @achartier in #9371
[None][infra] Update allowed list 2025.11.25 by @yuanjingx87 in #9468
[None][infra] Fail the pipeline when slurm ssh dropped by @yuanjingx87 in #9157
[None][feat] AutoDeploy: Remove redundant copies in mamba layers by @nvchenghaoz in #9461
[None][feat] AutoDeploy: Add A_log fusion for Mamba layers by @nvchenghaoz in #9422
[None][ci] Waive blackwell test on spec gate. by @zheyuf in #9502
[https://nvbugs/5608930][fix] Fix a typo by @Shixiaowei02 in #9487
[#9463][feat] Add revision option to trtllm commands by @achartier in #9498
[TRTLLM-9085][doc] fix math formula rendering issues by @QiJune in #9481
[None][chore] update comments in llm_args.py by @QiJune in #9472
[https://nvbugs/5680310][fix] Fix ctx only timed out test by @pcastonguay in #9410
[https://nvbugs/5547414][fix] enable case after using local cache model by @HuiGao-NV in #9473
[None][fix] Replace PYTORCH_CUDA_ALLOC_CONF with PYTORCH_ALLOC_CONF to fix deprecation warning by @jiaganc in #9294
[https://nvbugs/5698581][fix] Init draft tokens for CUDA graph dummy request by @ziyixiong-nv in #9505
[None][infra] Waive failed case in pre-merge on 11/27 by @EmmaQiaoCh in #9507
[TRTLLM-9513][docs] Qwen3 deployment guide by @lancelly in #9488
[None][chore] revert batch_size=1 to prevent timeout and lower accuracy reference by 0.12% as a WAR by @reasonsolo in #9447
[TRTLLM-9279][infra] Use flexcache for gh200 nodes since they locate in Austin by @EmmaQiaoCh in #9405
[https://nvbugs/5670793][fix] Solve trtllm-serve launch_disaggregated issue by @xxi-nv in #9346
[None][infra] Fix Slurm job script by @yuanjingx87 in #9508
[None][fix] change allreduce workspace dtype to torch.int64 to avoid overflow by @dc3671 in #9479
[None][feat] add qwen3-next CI test of accuracy on BF16 and NVFP4 by @JadoTu in #9330
[None][fix] fix TP support for DeepSeek-V3.2 on hopper by @lfr-0531 in #9484
[TRTLLM-9389][chore] Refactor AlltoallMethodType. by @bobboli in #9388
[https://nvbugs/5674665][chore] Add test coverage for https://nvbugspro.nvidia.com/bug/5674665 by @eopXD in #9518
[TRTLLM-7288][infra] Download merged waive list in slurm script by @yiqingy0 in #8999
[https://nvbugs/5687820][fix] Remove self.abort() in DetokenizedGenerationResult by @syuoni in #9449
[#9150][feat] AutoDeploy Nemotron-Flash support by @lucaslie in #9504
[None] [chore] Update to cutlass 4.3 by @kaiyux in #8637
[https://nvbugs/5637037][chore] Update waive lists. by @bobboli in #9386
[TRTLLM-8970][infra] Fix generate report when has isolation test result by @EmmaQiaoCh in #8861
[https://nvbugs/5685015][fix] Update invalid max_token test by @JunyiXu-nv in #9435
[None][fix] Fix on-disk cache and revise logger/statistics for AutoTuner. by @hyukn in #9211
[https://nvbugs/5689658][test] Fix gpu lock issue running on cluster by @yufeiwu-nv in #9441
[None][chore] add spec_decoding configs in perf benchmark scripts and fix typos by @lancelly in #9533
[None][fix] Remove FP8 K/V buffer from TRTLLM sparse MLA attention kernel by @chang-l in #9529
[None] [chore] Enhancements and clean up to slurm scripts by @kaiyux in #9493
[None][chore] Revert "[None][fix] change allreduce workspace dtype to torch.int64 t… by @dc3671 in #9538
[None][infra] Waive failed cases for main branch on 11/28 by @EmmaQiaoCh in #9539
[None][fix] Pass checkpoint_format to create_input_processor by @Funatiq in #9521
[TRTLLM-9541][infra] Use artifactory mirror for download.pytorch.org by @ZhanruiSunCh in #9477
[TRTLLM-9488][feat] add 'disable_flashinfer_sampling' config option by @ixlmar in #9454
[None][infra] Waive failed case in pre-merge on 11/28 by @dominicshanshan in #9537
[None][perf] Helix: improve all-to-all perf for large CP size by @MatthiasKohl in #9494
[None][feat] support for more accurate AR calculation by @binghanc in #9323
[TRTLLM-9488][fix] llmapi references by @ixlmar in #9547
[#8948][feat] Support custom sharding config by @greg-kwasniewski1 in #9143
[None][chore] Weekly mass integration of release/1.1 -- rebase by @dominicshanshan in #9522
[TRTLLM-5971][feat] Integrate helix parallelism by @brb-nv in #9342
[None][infra] - Request idle time exemption for OCI jobs by @chzblych in #9528
[None][infra] Wiave failed tests for main branch on 11/30 by @EmmaQiaoCh in #9555
[None][fix] Fix port conflict in disagg tests by @JunyiXu-nv in #9474
[None][ci] Split H100_PCIe-PyTorch-Post-Merge test stage by @chzblych in #9558
[None][ci] Split H100_PCIe-PyTorch-Post-Merge test stage by @chzblych in #9559
[TRTLLM-8958][feat] and [TRTLLM-8960]: create ConfigurableMoE and support TRTLLMGenFusedMoE as backend by @xxi-nv in #9486
[None] [feat] Optimize the algorithm part of RocketKV by @heyuhhh in #9333
[https://nvbugs/5690172][fix] Fix Qwen3-235B ATP accuracy issue with PDL by @syuoni in #9530
[TRTLLM-6222][feat] Extend cute_dsl_nvfp4_gemm to sm103. by @limin2021 in #9543
[None][fix] Correct virtual memory allocation alignment by @tongyuantongyu in #9491
[https://nvbugs/5684703][fix] Unwaive disagg guided decoding test by @syuoni in #9466
[https://nvbugs/5503479][fix] Temporarily lower reference accuracy to stabilize CI by @pengbowang-nv in #9398
[None][chore] remove qwen3-next accuracy tests by @JadoTu in #9534
[None][doc] fix mtp.py typo by @attack204 in #9307
[None][feat] add chat template kwargs support to longbench-v2 by @lfr-0531 in #9544
[#9496][fix] AutoDeploy: remove auto-tuner from nvfp4_gemm forward by @nzmora-nvidia in #9497
[None][fix] Replace hash method with unique_id for cutedsl MoE runners. by @hyukn in #9569
[None][chore] refactor disaggregated scripts to use named arguments by @dc3671 in #9581
[TRTLLM-6222][feat] Several perf opt for cuteDSL nvf4 gemm by @liyuhannnnn in #9428
[None][chore] reduce the layers of the devel docker image by @MartinMarciniszyn in #9077
[https://nvbugs/5651854][infra] Enable perf metrics during accuracy testing by @Shixiaowei02 in #9140
[None][fix] Skip Allreduce init for Attention DP by @syuoni in #9542
[None][test] [None][test] Waive main branch test failures 12/1 by @chzblych in #9566
[None][ci] Minor change for Slurm scripts by @chzblych in #9561
[TRTLLM-6768][infra] Fix params for not updating github status by @yiqingy0 in #6747
[None][infra] Update the pytest options after MI by @EmmaQiaoCh in #9579
[TRTLLM-6756][feat] Add Beam Search to TorchSampler by @stnie in #8509
[None][chore] Defer exposing context parallel configs by @brb-nv in #9552
[TRTC-1943][feat] Env vars override support in LLM API by @venkywonka in #9104
[None][feat] AutoDeploy: Use the router gemm op for nemotron MOE by @nvchenghaoz in #9500
[#9198][feat] Refactor dist ops in AutoDeploy by @MrGeva in #9301
[None][fix] Prevent YAML partial kv_cache_config from incorrectly overriding the complete kv_cache_config by @Yuening-wa in #9262
[TRTLLM-9085][doc] fix math formula rendering issues in github by @QiJune in #9605
[None][feat] Unify nvfp4 gemm backend by @Wong4j in #8963
[None][feat] Add support for KVCache reuse for DSv32 by @Tabrizian in #9383
[None][chroe] Polish qwen3-next modeling code. by @nv-guomingz in #8902
[https://nvbugs/5703953][fix] Use random port for disagg tests by @JunyiXu-nv in #9582
[TRTLLM-8638][fix] Waive gb200 by @xinhe-nv in #9580
[FMDL-1328][feat] Add support for nano-v3 and super-v3 with pytorch backend by @Wanli-Jiang in #9261
[https://nvbugs/5582091][test] increase warmup times in testing for multi-gpu cases by @ruodil in #9578
[TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #9588
[https://nvbugs/5702793][fix] Fix uncontiguous tensor view by @shuyixiong in #9576
[None][infra] Waive failed cases for main branch by @EmmaQiaoCh in #9615
[TRTLLM-9488][feat] use FlashInfer.sampling by default by @ixlmar in #9545
[None][infra] Update allowlist 2025/12/01 by @yuanjingx87 in #9616
[None][infra] Remove an invalid test name in waives.txt by @EmmaQiaoCh in #9620
[#8391][chore] Lock the gpu clocks in L0 perf tests by @MrGeva in #9585
[TRTLLM-9466][test] Evaluate helix parallelism with DSV3 Lite by @brb-nv in #9597
[None][fix] Extract GPU count from single-node stage names by @chang-l in #9599
[https://nvbugs/5667774][fix] Refine Piecewise Cuda Graph Condition for DP by @liji-nv in #9393
[TRTLLM-9144][fix] enhance RPC robustness by @Superjomn in #8711
[https://nvbugs/5627710][fix] Fix synchronization bugs in KvCacheTransferManager that can cause corrupted blocks by @thorjohnsen in #9056
[TRTLLM-8980][test] Clean up spec dec tests in test_llm_api_pytorch by @mikeiovine in #8889
[#9150][feat] Add code for nano v3 to custom implementation in AD by @2ez4bz in #9465
[#9150][feat] AutoDeploy: reviewer comments for #9150 by @lucaslie in #9527
[https://nvbugs/5651854][fix] Fix dist-serving perf by clearing CPU affinity by @Shixiaowei02 in #9549
[#9550][feat] Add NVFP4 Cutlass MoE kernels for AutoDeploy by @nzmora-nvidia in #9551
[TRTLLM-9547][https://nvbugs/5688388][fix] fix: Reducing num request in disagg test to speed up by @pcastonguay in #9598
[TRTLLM-8946][feat] Improved heuristics to detect shardable regions by @greg-kwasniewski1 in #9200
[#9632][feat] Support EXTRA_WHEEL_BUILD_ARGS during wheel build by @michael132 in #9633
[None][chore] Waive test failing on pre-merge by @brb-nv in #9638
[None][chore] Remove traceback dump for multimodal input processor by @chang-l in #9634
[None][chore] Fix trtllm-eval and move GroupedGemmInputsHelper by @syuoni in #9612
[https://nvbugs/5698434][fix] Use separate weight mapper for draft by @amukkara in #9607
[TRTLLM-7101][infra] Reuse passed tests by @yiqingy0 in #6894
[None][test] Remove duplicate test cases by @yufeiwu-nv in #9623
[None][feat] Add RocketKV usage doc and e2e accuracy test on LongBenchV2 by @heyuhhh in #9572
[TRTLLM-9242][doc] Add examples showcasing openai compatible APIs by @JunyiXu-nv in #9520
[None][chore] AutoDeploy update cuda stream manager for multi-device by @suyoggupta in #9575
[TRTLLM-9391][chore] Automatically estimate required workspace. by @bobboli in #9535
[https://nvbugs/5708475][fix] Fix e2e eval accuracy for helix parallelism by @brb-nv in #9647
[https://nvbugs/5561153][test] Fix log error for perf test by @fredricz-20070104 in #9622
[TRTLLM-8241][feat] Aliasing to comply to LlmArgs by @LinPoly in #9586
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #9593
[TRTLLM-6842][feat] Support Response API for general purpose by @JunyiXu-nv in #9392
[None][test] Update Qwen3-next accuracy testing by setting the cuda … by @nv-guomingz in #9613
[None][feat] update trtllm-gen nvfp4 kernels with better performance by @PerkzZheng in #9510
[None][doc] Replace the tensorrt icon with torch icon on overview.md by @nv-guomingz in #9644
[https://nvbugs/5705197][chore] Unwaive timeout disagg tests by @pcastonguay in #9637
[https://nvbugs/5552132][fix] Enable LoRa for GPT OSS Torch by @moraxu in #8253
[None][fix] Fix wide ep MoE error by @Tabrizian in #9642
[https://nvbugs/5702795][fix] Remove the warning message for aten.log. by @nv-guomingz in #9665
[https://nvbugs/5693853][fix] Fix error handling when querying machin… by @galagam in #9483
[OMNIML-2932] [feat] nvfp4 awq support by @meenchen in #8698
[#9643][fix] AutoDeploy: fix nano sharding config by @lucaslie in #9668
[#9147][feat] AutoDeploy: Draft Target Speculative Decoding by @govind-ramnarayan in #9275
[None][feat] Update Qwen3CodeToolParser to align tool-calling parameters by @Wanli-Jiang in #9540
[TRTLLM-7181][infra] Generate test results when pytest timeout happens by @yiqingy0 in #9396
[TRTLLM-9522][fix] restore trtllm-serve mm_embedding_serve by @ixlmar in #9669
[TRTLLM-5093][infra] Write env variables to a file in the interactive debug session by @yiqingy0 in #6792
[None][fix] fix error when processing batches containing both text and mm data by @Nekofish-L in #8381
[TRTLLM-7073][feat] Support torch compile for PP for Llama and DeepSeekV3 by @liji-nv in #7838
[None][feat] Add weights initialization and context phase parser to layer-wise benchmarks by @yuantailing in #9667
[TRTLLM-8274][feat] Check if executor is shutdown in /health entrypoint by @JunyiXu-nv in #9057
[#8733][feat] Add Llama4 MoE handling to AutoDeploy by @tcherckez-nvidia in #9556
[None][ci] unwaive tests by @Superjomn in #9651
[None][feat] Add NIXL-LIBFABRIC support by @zackyoray in #9225
[None][test] rename wide ep and disagg metric name in perf test by @ruodil in #9704
[https://nvbugs/5467531][fix] Unwaive fused_moe all to all test with DeepEPLowLatency by @liji-nv in #9617
[None][fix] Recover TRTLLM MoE Perf for DEP by @rosenrodt in #9562
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #9662
[None][fix] Fix TLLM_SPEC_DECODE_FORCE_NUM_ACCEPTED_TOKENS for MTP/EAGLE by @achartier in #9608
[None][infra] Add container notices and documentation by @pdrake-nv in #9185
[TRTLLM-5312][infra] Add triton trigger rules by @yiqingy0 in #6440
[None][doc] Add feature docs for helix parallelism by @brb-nv in #9684
[TRTLLM-9579][infra] Set mergeWaiveList stage UNSTABLE when there is any issue by @yiqingy0 in #9692
[None][doc] Added line about partial reuse by @thorjohnsen in #7846
[TRTLLM-8920][feat] decouple disagg service from fastapi by @reasonsolo in #8714
[https://nvbugs/5633340][fix] start disagg workers and servers on free ports by @reasonsolo in #9694
[TRTLLM-9562] [doc] Add Deployment Guide for Kimi K2 Thinking on TensorRT LLM - Blackwell by @kaiyux in #9711
[#9602][feat] AutoDeploy: Support TRTLLM Sampler by @govind-ramnarayan in #9641
[None] [tests] Unwaive EPLB tests by @kaiyux in #9625
[https://nvbugs/5518713][test] Refactor core test lists by merging with llm_perf_cluster.yml by @yufeiwu-nv in #9714
[TRTLLM-7136][feat] Update load_weights method to include mapping parameter in checkpoint loaders by @Funatiq in #9583
[None][refactor] Improve request processing function in sampler by @Funatiq in #9671
[https://nvbugs/5670672][fix] Fix flaky KV connector tests by @jthomson04 in #9676
[None][infra] Update allowed list 20251204 by @yuanjingx87 in #9718
[None][feat] AutoDeploy: Perf optimization for Attention and rmsnorm by @nvchenghaoz in #9719
[None][chore] Waive flakey disagg tests by @mikeiovine in #9749
[None][chore] Weekly mass integration of release/1.1 by @mikeiovine in #9594
[None][fix] Fix triton moe load_weight by @shuyixiong in #9649
[None][fix] fix a bug: deepseek_fp8_block_scales in TRTLLMGEN-MoE use 2D x_sf instead of 1D by @xxi-nv in #9658
[TRTLLM-9372][feat] Enable CuteDSL MoE with Large EP by @syuoni in #9592
[TRTLLM-9522][chore] implement default attach_multimodal_embeddings by @ixlmar in #9664
[TRTLLM-9660][feat] Convert cuteDSL GEMM to opt-in feature by @longlee0622 in #9682
[None][fix] enable hmac in RPC by @Superjomn in #9745

New Contributors

@attack204 made their first contribution in #9307
@liyuhannnnn made their first contribution in #9428
@michael132 made their first contribution in #9633
@tcherckez-nvidia made their first contribution in #9556
@pdrake-nv made their first contribution in #9185

Full Changelog: v1.2.0rc4...v1.2.0rc5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.2.0rc5

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Announcement Highlights

What's Changed

What's Changed

New Contributors

Contributors

Uh oh!