Skip to content

v1.2.0rc5

Pre-release
Pre-release

Choose a tag to compare

@dc3671 dc3671 released this 01 Dec 06:35
· 1100 commits to main since this release
e4c7078

Announcement Highlights

  • Vulnerability

    • Two security vulnerabilities have been identified in the urllib3 package versions >= 1.24 and < 2.6.0. These issues will be addressed in the next release. For detailed information on the vulnerabilities, refer to the following advisories:
      GHSA-gm62-xv2j-4w53
      GHSA-2xpw-w6gg-jr37
      To mitigate the issues immediately, users are advised to upgrade urllib3 to version 2.6.0 or later.
  • Model Support

    • Slimmed down implementation of Nemotron H (#9235)
    • Add support Starcoder2 PyTorch backend (#8923)
    • Add support MLA chunked prefill for DeepSeek V3.2 model (#9376)
    • Add support AutoDeploy Nemotron-Flash (#9504)
    • AutoDeploy: Add support Llama4 MoE handling (#9556)
    • Add support for nano-v3 and super-v3 with PyTorch backend (#9261)
    • AutoDeploy: Add support for nano v3 to custom implementation (#9465)
  • API

    • Add revision option to trtllm commands (#9498)
    • Add support to override env vars in LLM API (#9104)
    • Support Response API for general purpose (#9392)
  • Feature

    • Add support for KVCache reuse for DeepSeek V3.2 (#9383)
    • Support Yarn on QwQ-32B model (#9059)
    • Update DeepGEMM to include optimizations for DeepSeek-v3.2 (#9380)
    • Cold L2 cache when doing autotune benchmarking (#8779)
    • Improve TRTLLM MoE throughput for small hidden size (#9377)
    • Add parser to layer-wise benchmarks (#9440)
    • Support custom chat template for tool calling (#9297)
    • Add draft token tree runtime on CDL (#8586)
    • Top-p optimization by removing redundant softmax (#9411)
    • Use FlashInfer's top_k_sampling_from_probs (#9457)
    • Overlap context chunks in pipeline parallel mode (#9308)
    • Improve all-to-all perf for large CP size in Helix (#9494)
    • Support more accurate AR calculation (#9323)
    • Support custom config of sharding (#9143)
    • Integrate helix parallelism (#9342)
    • Optimize RocketKV algorithm (#9333)
    • Extend cute_dsl_nvfp4_gemm to sm103 (#9543)
    • Add chat template kwargs support to longbench-v2 (#9544)
    • Add Beam Search to TorchSampler (#8509)
    • Unify nvfp4 gemm backend (#8963)
    • Use FlashInfer.sampling by default (#9545)
    • Add RocketKV usage doc and e2e accuracy test on LongBenchV2 (#9572)
    • Alias to comply to LlmArgs (#9586)
    • Update trtllm-gen nvfp4 kernels with better performance (#9510)
    • Enable CuteDSL MoE with Large EP (#9592)
    • Convert cuteDSL GEMM to opt-in feature (#9682)
    • Optimize the load_weights method to include mapping parameter (#9583)
    • Support torch compile for pipeline parallel Llama and DeepSeekV3 (#7838)
    • Check if executor is shutdown in /health entrypoint (#9057)
    • Add NIXL-LIBFABRIC support (#9225)
    • Decouple disagg service from FastAPI (#8714)
    • AutoDeploy: Add NVFP4 Cutlass MoE kernels (#9551)
    • AutoDeploy: Draft Target Speculative Decoding (#9275)
    • AutoDeploy: Support TRTLLM Sampler (#9641)
    • AutoDeploy: Perf optimization for Attention and rmsnorm (#9719)
    • AutoDeploy: Use router gemm op for Nemotron MOE (#9500)
    • AutoDeploy: Remove redundant copies in mamba layers (#9461)
    • AutoDeploy: Add A_log fusion for Mamba layers (#9422)
    • AutoDeploy: Update dist ops (if not already) (#9301)
    • AutoDeploy: Perf optimization entries (if not already in Feature) (#9719)
  • Fix

    • Modify qwen3-next sampling stop_tokens (#9331)
    • Fix mismatched nvfp4 gemm sf shape (#9336)
    • Enhance warning in cacheTransBuffer (#9390)
    • Fix top-k outIndices with vectorized_process (#9404)
    • Let KV cache manager block initialization respect dry run (#9093)
    • Avoid cudaFree overlap with cuda graph (#9438)
    • Fix TP support for DeepSeek-V3.2 on Hopper (#9484)
    • Fix Qwen3-235B ATP accuracy issue with PDL (#9530)
    • Correct virtual memory allocation alignment (#9491)
    • Fix view operation on uncontiguous tensor (#9576)
    • Extract GPU count from single-node stage names (#9599)
    • Refine Piecewise Cuda Graph condition for DP (#9393)
    • Enhance RPC robustness (#8711)
    • Fix synchronization bugs in KvCacheTransferManager preventing corrupted blocks (#9056)
    • Fix dist-serving performance by clearing CPU affinity (#9549)
    • Fix wide ep MoE error (#9642)
    • Fix LoRa enablement for GPT OSS Torch (#8253)
    • Recover TRTLLM MoE performance for DEP (#9562)
    • Fix error when processing batches containing both text and multimodal data (#8381)
    • Fix deepseek_fp8_block_scales using 2D x_sf in TRTLLMGEN-MoE (#9658)
    • Enable hmac in RPC (#9745)
    • Start disagg workers and servers on free ports (#9694)
    • Fix bug: deepseek_fp8_block_scales uses 2D x_sf instead of 1D (#9658)
    • AutoDeploy: fix nano sharding config (#9668)
    • AutoDeploy: Remove auto-tuner from nvfp4_gemm forward (#9497)
  • Documentation

    • Fix math formula rendering issues (#9481)
    • Qwen3 deployment guide (#9488)
    • KV Connector Docs (#9325)
    • Deployment Guide for Kimi K2 Thinking on TensorRT LLM - Blackwell (#9711)
    • Add feature docs for helix parallelism (#9684)
    • Add examples showcasing OpenAI compatible APIs (#9520)
    • Update Linux installation guide (#9485)
    • Refine the slurm examples (#9548)
    • Link to modelopt checkpoints in quick start guide (#9571)
  • Test & Infra

    • Rename AlltoAll backend names (#9329)
    • Move build config from BaseLlmArgs to TrtLlmArgs (#9249)
    • Reduce nested nvtx ranges (#9347)
    • Add disagg and wideep multi-node multi-gpu test cases (#9356)
    • Upgrade CuteDSL to 4.3.0 (#9444)
    • Use flexcache for gh200 nodes (#9405)
    • Evaluate helix parallelism with DSV3 Lite (#9597)
    • AutoDeploy update cuda stream manager for multi-device (#9575)
    • Add container notices and documentation (#9185)
    • Increase warmup times in multi-gpu testing (#9578)

What's Changed

  • [#9316][feat] AutoDeploy: Add the accuracy test for Nemotron MOE models by @nvchenghaoz in #9317
  • [#9096][feature] Auto Deploy: configurable fused MoE backend by @nzmora-nvidia in #9194
  • [None][fix] Use fp32 for indexer weight_proj GEMM by @chang-l in #9243
  • [None][fix] Multimodal InputProcessor dummy builder fix by @yechank-nvidia in #8916
  • [None][ci] waive test_disagg_server_restart by @QiJune in #9326
  • [None][chore] Revise the description of enable_autotuner. by @hyukn in #9320
  • [TRTLLM-9295][fix] use greedy decoding in test_openai_compatible_json_schema by @ixlmar in #9305
  • [TRTLLM-9164][infra] Enable checking duplicate items in waives.txt in pre-commit by @EmmaQiaoCh in #9265
  • [#9236][feature] Make sharing of activation_type across SW layers more robust by @nzmora-nvidia in #9238
  • [https://nvbugs/5667687][fix] Set correct lm_head_tp_size_upper_bound by @lancelly in #9300
  • [https://nvbugs/5667454][test] Fix Test Case as Chunked Attention not Supported on sm_120 by @yufeiwu-nv in #9260
  • [None][chore] Weekly mass integration of release/1.1 by @mikeiovine in #8918
  • [None][chore] Upgrade starlette and FastAPI by @tburt-nv in #9319
  • [None][infra] Update goggles_action repository by @karljang in #9240
  • [TRTLLM-9197][infra] Move thirdparty stuff to it's own listfile by @cheshirekow in #8986
  • [TRI-332] [fix] Fix L0_backend_trtllm by @yinggeh in #9282
  • [None][ci] waive test_llm_context_only_timed_out_kv_cache_exhausted by @QiJune in #9351
  • [None][infra] Add fallback when get wheel from build stage is fail by @ZhanruiSunCh in #9290
  • [TRTLLM-9183][infra] Add --waives-file in rerun pytest command by @yiqingy0 in #8971
  • [TRTLLM-8957][feat] create communication related classes by @xxi-nv in #8968
  • [None][chore] Add periodic junit xml path in conftest by @crazydemo in #9337
  • [None][ci] waive a test case of test_ad_build_small_multi.py by @QiJune in #9355
  • [None][infra] Waive failed cases in main post-merge on 11/21 by @EmmaQiaoCh in #9360
  • [None][chore] Bump version to 1.2.0rc4 by @yiqingy0 in #9363
  • [TRTLLM-8650][fix] beam search request validation (#8433) by @ixlmar in #9228
  • [TRTLLM-9191][feat] support out-of-tree models in trtllm-serve by @ixlmar in #9269
  • [https://nvbugs/5629833][fix] Don't fill tensors by @HuiGao-NV in #9296
  • [None][feat] TRT-LLM Gen MoE optimize DeepSeek Fp8 activation kernel by @nekorobov in #9175
  • [https://nvbugs/5590408][fix] Fallback to greedy sampling in two-model overlap scheduler by @ziyixiong-nv in #9321
  • [TRTLLM-9208][infra] Document the process for C++ deps by @cheshirekow in #9016
  • [TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) by @syuoni in #9288
  • [None][feat] Eagle: PostNorm and multilayer options by @IzzyPutterman in #9233
  • [TRTLLM-9082][feat] AutoDeploy: Move the moe Align kernel to AOT by @nvchenghaoz in #9106
  • [#9388][fix] AutoDeploy: Fix cutlass BF16 MoE kernel invocation by @nzmora-nvidia in #9339
  • [TRTINFRA-7326][infra] - Consume SlurmCluster sshPort for clusters with custom SSH port by @mlefeb01 in #9313
  • [None][test] Add one-model and overlap-scheduling to eagle tests for GPTOSS by @dongfengy in #9312

Full Changelog: v1.2.0rc3...v1.2.0rc4

What's Changed

  • [None][ci] waive two ray tests by @Superjomn in #9375
  • [#9230][feat] Slimmed down implementation of nemotron H by @2ez4bz in #9235
  • [None][fix] modify qwen3-next sampling stop_tokens by @JadoTu in #9331
  • [None][chore] AutoDeploy: Add the Nemotron MOE to CI by @nvchenghaoz in #9328
  • [TRTLLM-9389][chore] Rename AlltoAll backend names by @bobboli in #9329
  • [TRTLLM-7963][fix] Several improvements of autotuning quality by @hyukn in #9348
  • [TRTLLM-9302][chore] Move build config from BaseLlmArgs to TrtLlmArgs by @QiJune in #9249
  • [https://nvbugs/5637012][fix] Fix helix unit tests by @brb-nv in #9369
  • [https://nvbugs/5676748][fix] Fix mismatched nvfp4 gemm sf shape. by @hyukn in #9336
  • [None][chore] Remove unnecessary log in the short tuning profile by @hyukn in #9387
  • [None][infra] Waive failed cases on main branch by @EmmaQiaoCh in #9384
  • [TRTLLM-9211][infra] Minor fixes to 3rdparty/CMakelists by @cheshirekow in #9365
  • [TRTLLM-9299][infra] Add third-party docs for python by @cheshirekow in #9366
  • [None][infra] Waive failed cases for main by @EmmaQiaoCh in #9400
  • [None][fix] enhance warning in cacheTransBuffer by @chuangz0 in #9390
  • [None][fix] Fix topk outIndices when using vectorized_process by @yweng0828 in #9404
  • [TRTLLM-7967][feat] Adding Starcoder2 PyTorch Backend Support by @yibinl-nvidia in #8923
  • [None][feat] Support Yarn on QwQ-32B model by @byshiue in #9059
  • [https://nvbugs/5685428][fix] fix test_openai_chat_multimodal.py by @QiJune in #9406
  • [TRTLLM-8777][feat] Update DeepGEMM to the latest commit to include optimizations for DeepSeek-v3.2 by @lfr-0531 in #9380
  • [None][chore] Reduce nested nvtx ranges. by @yuxianq in #9347
  • [None][chore] Remove closed bugs by @xinhe-nv in #9381
  • [None][chore] unwaive ampere kernels test by @kris1025 in #9389
  • [#9271][perf] Enable multi-stream MOE optimization in AutoDeploy by @suyoggupta in #9322
  • [#9413][fix] Minor fixes to nemotron H and custom models in AD by @2ez4bz in #9416
  • [TRTLLM-7963][feat] Cold L2 cache when doing autotune benchmarking. by @hyukn in #8779
  • [None][infra] Waive failed cases for main branch on 11/25 by @EmmaQiaoCh in #9429
  • [#8391][chore] test_perf.py to lock clocks read from gpu_configs.yml instead of max freq by @MrGeva in #9409
  • [None][ci] Move more test stages to use OCI machines by @chzblych in #9395
  • [None][feat] Improve TRTLLM MoE in small hidden size throughput cases by @rosenrodt in #9377
  • [https://nvbugs/5537996][fix] Let KV cache manager block initialization be aware whether it is doing a dry run or not by @eopXD in #9093
  • [https://nvbugs/5667922][fix] Update long context evaluation config by @baize97 in #9426
  • [http://nvbugs/5608930][fix] Mitigate test timeout issues by @Shixiaowei02 in #9445
  • [None][chore] Fix trtllm-eval for PyTorchLLM by @lfr-0531 in #9427
  • [None][feat] Add a parser to layer-wise benchmarks by @yuantailing in #9440
  • [None][feat] Support custom chat template for tool calling by @LinPoly in #9297
  • [TRTLLM-8160][feat] Add draft token tree runtime on CDL by @yweng0828 in #8586
  • [None][ci] waive a test by @Superjomn in #9458
  • [https://nvbugs/5680905][fix] Relax the MMLU accuracy requirement for DS-v3.2 by @lfr-0531 in #9439
  • [TRTLLM-8376][feat] top-p optimization (removes redundant softmax) by @ixlmar in #9411
  • [TRTLLM-9490][feat] use FlashInfer's top_k_sampling_from_probs by @ixlmar in #9457
  • [https://nvbugs/5647400] [fix] Enlarged the AllReduce workspace size to 64MB. Added AllReduce strategy to AD config. by @MrGeva in #9145
  • [TRTLLM-909][feat] Overlap context chunks in pipeline parallel mode by @Funatiq in #9308
  • [None][chore] AutoDeploy add multi stream moe pass to default.yaml by @suyoggupta in #9430
  • [https://nvbugs/5685143][fix] avoid cudaFree overlap with cuda graph by @chuangz0 in #9438
  • [None][chore] Bump version to 1.2.0rc5 by @yiqingy0 in #9455
  • [TRTLLM-8936][test] Add disagg and wideep multi-node multi-gpu test cases by @fredricz-20070104 in #9356
  • [None][ci] move some slow test cases of DGX-B200 to post merge by @QiJune in #9467
  • [TRTLLM-9293][feat] Enable partial weight loading to support streaming update weights by @shuyixiong in #9224
  • [TRTLLM-9264][fix] Add accuracy/unit tests/doc for phi4mm by @Wanli-Jiang in #9246
  • [https://nvbugs/5580099][fix] Cherry pick IMA issue fix from release/1.1 by @JunyiXu-nv in #9032
  • [None][chore] Upgrade CuteDSL to 4.3.0 by @syuoni in #9444
  • [None][feat] Support MLA chunked prefill for DeepSeek V3.2 model by @chang-l in #9376
  • [None][feat] Add environment variable to force spec-dec number of accepted tokens by @achartier in #9371
  • [None][infra] Update allowed list 2025.11.25 by @yuanjingx87 in #9468
  • [None][infra] Fail the pipeline when slurm ssh dropped by @yuanjingx87 in #9157
  • [None][feat] AutoDeploy: Remove redundant copies in mamba layers by @nvchenghaoz in #9461
  • [None][feat] AutoDeploy: Add A_log fusion for Mamba layers by @nvchenghaoz in #9422
  • [None][ci] Waive blackwell test on spec gate. by @zheyuf in #9502
  • [https://nvbugs/5608930][fix] Fix a typo by @Shixiaowei02 in #9487
  • [#9463][feat] Add revision option to trtllm commands by @achartier in #9498
  • [TRTLLM-9085][doc] fix math formula rendering issues by @QiJune in #9481
  • [None][chore] update comments in llm_args.py by @QiJune in #9472
  • [https://nvbugs/5680310][fix] Fix ctx only timed out test by @pcastonguay in #9410
  • [https://nvbugs/5547414][fix] enable case after using local cache model by @HuiGao-NV in #9473
  • [None][fix] Replace PYTORCH_CUDA_ALLOC_CONF with PYTORCH_ALLOC_CONF to fix deprecation warning by @jiaganc in #9294
  • [https://nvbugs/5698581][fix] Init draft tokens for CUDA graph dummy request by @ziyixiong-nv in #9505
  • [None][infra] Waive failed case in pre-merge on 11/27 by @EmmaQiaoCh in #9507
  • [TRTLLM-9513][docs] Qwen3 deployment guide by @lancelly in #9488
  • [None][chore] revert batch_size=1 to prevent timeout and lower accuracy reference by 0.12% as a WAR by @reasonsolo in #9447
  • [TRTLLM-9279][infra] Use flexcache for gh200 nodes since they locate in Austin by @EmmaQiaoCh in #9405
  • [https://nvbugs/5670793][fix] Solve trtllm-serve launch_disaggregated issue by @xxi-nv in #9346
  • [None][infra] Fix Slurm job script by @yuanjingx87 in #9508
  • [None][fix] change allreduce workspace dtype to torch.int64 to avoid overflow by @dc3671 in #9479
  • [None][feat] add qwen3-next CI test of accuracy on BF16 and NVFP4 by @JadoTu in #9330
  • [None][fix] fix TP support for DeepSeek-V3.2 on hopper by @lfr-0531 in #9484
  • [TRTLLM-9389][chore] Refactor AlltoallMethodType. by @bobboli in #9388
  • [https://nvbugs/5674665][chore] Add test coverage for https://nvbugspro.nvidia.com/bug/5674665 by @eopXD in #9518
  • [TRTLLM-7288][infra] Download merged waive list in slurm script by @yiqingy0 in #8999
  • [https://nvbugs/5687820][fix] Remove self.abort() in DetokenizedGenerationResult by @syuoni in #9449
  • [#9150][feat] AutoDeploy Nemotron-Flash support by @lucaslie in #9504
  • [None] [chore] Update to cutlass 4.3 by @kaiyux in #8637
  • [https://nvbugs/5637037][chore] Update waive lists. by @bobboli in #9386
  • [TRTLLM-8970][infra] Fix generate report when has isolation test result by @EmmaQiaoCh in #8861
  • [https://nvbugs/5685015][fix] Update invalid max_token test by @JunyiXu-nv in #9435
  • [None][fix] Fix on-disk cache and revise logger/statistics for AutoTuner. by @hyukn in #9211
  • [https://nvbugs/5689658][test] Fix gpu lock issue running on cluster by @yufeiwu-nv in #9441
  • [None][chore] add spec_decoding configs in perf benchmark scripts and fix typos by @lancelly in #9533
  • [None][fix] Remove FP8 K/V buffer from TRTLLM sparse MLA attention kernel by @chang-l in #9529
  • [None] [chore] Enhancements and clean up to slurm scripts by @kaiyux in #9493
  • [None][chore] Revert "[None][fix] change allreduce workspace dtype to torch.int64 t… by @dc3671 in #9538
  • [None][infra] Waive failed cases for main branch on 11/28 by @EmmaQiaoCh in #9539
  • [None][fix] Pass checkpoint_format to create_input_processor by @Funatiq in #9521
  • [TRTLLM-9541][infra] Use artifactory mirror for download.pytorch.org by @ZhanruiSunCh in #9477
  • [TRTLLM-9488][feat] add 'disable_flashinfer_sampling' config option by @ixlmar in #9454
  • [None][infra] Waive failed case in pre-merge on 11/28 by @dominicshanshan in #9537
  • [None][perf] Helix: improve all-to-all perf for large CP size by @MatthiasKohl in #9494
  • [None][feat] support for more accurate AR calculation by @binghanc in #9323
  • [TRTLLM-9488][fix] llmapi references by @ixlmar in #9547
  • [#8948][feat] Support custom sharding config by @greg-kwasniewski1 in #9143
  • [None][chore] Weekly mass integration of release/1.1 -- rebase by @dominicshanshan in #9522
  • [TRTLLM-5971][feat] Integrate helix parallelism by @brb-nv in #9342
  • [None][infra] - Request idle time exemption for OCI jobs by @chzblych in #9528
  • [None][infra] Wiave failed tests for main branch on 11/30 by @EmmaQiaoCh in #9555
  • [None][fix] Fix port conflict in disagg tests by @JunyiXu-nv in #9474
  • [None][ci] Split H100_PCIe-PyTorch-Post-Merge test stage by @chzblych in #9558
  • [None][ci] Split H100_PCIe-PyTorch-Post-Merge test stage by @chzblych in #9559
  • [TRTLLM-8958][feat] and [TRTLLM-8960]: create ConfigurableMoE and support TRTLLMGenFusedMoE as backend by @xxi-nv in #9486
  • [None] [feat] Optimize the algorithm part of RocketKV by @heyuhhh in #9333
  • [https://nvbugs/5690172][fix] Fix Qwen3-235B ATP accuracy issue with PDL by @syuoni in #9530
  • [TRTLLM-6222][feat] Extend cute_dsl_nvfp4_gemm to sm103. by @limin2021 in #9543
  • [None][fix] Correct virtual memory allocation alignment by @tongyuantongyu in #9491
  • [https://nvbugs/5684703][fix] Unwaive disagg guided decoding test by @syuoni in #9466
  • [https://nvbugs/5503479][fix] Temporarily lower reference accuracy to stabilize CI by @pengbowang-nv in #9398
  • [None][chore] remove qwen3-next accuracy tests by @JadoTu in #9534
  • [None][doc] fix mtp.py typo by @attack204 in #9307
  • [None][feat] add chat template kwargs support to longbench-v2 by @lfr-0531 in #9544
  • [#9496][fix] AutoDeploy: remove auto-tuner from nvfp4_gemm forward by @nzmora-nvidia in #9497
  • [None][fix] Replace hash method with unique_id for cutedsl MoE runners. by @hyukn in #9569
  • [None][chore] refactor disaggregated scripts to use named arguments by @dc3671 in #9581
  • [TRTLLM-6222][feat] Several perf opt for cuteDSL nvf4 gemm by @liyuhannnnn in #9428
  • [None][chore] reduce the layers of the devel docker image by @MartinMarciniszyn in #9077
  • [https://nvbugs/5651854][infra] Enable perf metrics during accuracy testing by @Shixiaowei02 in #9140
  • [None][fix] Skip Allreduce init for Attention DP by @syuoni in #9542
  • [None][test] [None][test] Waive main branch test failures 12/1 by @chzblych in #9566
  • [None][ci] Minor change for Slurm scripts by @chzblych in #9561
  • [TRTLLM-6768][infra] Fix params for not updating github status by @yiqingy0 in #6747
  • [None][infra] Update the pytest options after MI by @EmmaQiaoCh in #9579
  • [TRTLLM-6756][feat] Add Beam Search to TorchSampler by @stnie in #8509
  • [None][chore] Defer exposing context parallel configs by @brb-nv in #9552
  • [TRTC-1943][feat] Env vars override support in LLM API by @venkywonka in #9104
  • [None][feat] AutoDeploy: Use the router gemm op for nemotron MOE by @nvchenghaoz in #9500
  • [#9198][feat] Refactor dist ops in AutoDeploy by @MrGeva in #9301
  • [None][fix] Prevent YAML partial kv_cache_config from incorrectly overriding the complete kv_cache_config by @Yuening-wa in #9262
  • [TRTLLM-9085][doc] fix math formula rendering issues in github by @QiJune in #9605
  • [None][feat] Unify nvfp4 gemm backend by @Wong4j in #8963
  • [None][feat] Add support for KVCache reuse for DSv32 by @Tabrizian in #9383
  • [None][chroe] Polish qwen3-next modeling code. by @nv-guomingz in #8902
  • [https://nvbugs/5703953][fix] Use random port for disagg tests by @JunyiXu-nv in #9582
  • [TRTLLM-8638][fix] Waive gb200 by @xinhe-nv in #9580
  • [FMDL-1328][feat] Add support for nano-v3 and super-v3 with pytorch backend by @Wanli-Jiang in #9261
  • [https://nvbugs/5582091][test] increase warmup times in testing for multi-gpu cases by @ruodil in #9578
  • [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #9588
  • [https://nvbugs/5702793][fix] Fix uncontiguous tensor view by @shuyixiong in #9576
  • [None][infra] Waive failed cases for main branch by @EmmaQiaoCh in #9615
  • [TRTLLM-9488][feat] use FlashInfer.sampling by default by @ixlmar in #9545
  • [None][infra] Update allowlist 2025/12/01 by @yuanjingx87 in #9616
  • [None][infra] Remove an invalid test name in waives.txt by @EmmaQiaoCh in #9620
  • [#8391][chore] Lock the gpu clocks in L0 perf tests by @MrGeva in #9585
  • [TRTLLM-9466][test] Evaluate helix parallelism with DSV3 Lite by @brb-nv in #9597
  • [None][fix] Extract GPU count from single-node stage names by @chang-l in #9599
  • [https://nvbugs/5667774][fix] Refine Piecewise Cuda Graph Condition for DP by @liji-nv in #9393
  • [TRTLLM-9144][fix] enhance RPC robustness by @Superjomn in #8711
  • [https://nvbugs/5627710][fix] Fix synchronization bugs in KvCacheTransferManager that can cause corrupted blocks by @thorjohnsen in #9056
  • [TRTLLM-8980][test] Clean up spec dec tests in test_llm_api_pytorch by @mikeiovine in #8889
  • [#9150][feat] Add code for nano v3 to custom implementation in AD by @2ez4bz in #9465
  • [#9150][feat] AutoDeploy: reviewer comments for #9150 by @lucaslie in #9527
  • [https://nvbugs/5651854][fix] Fix dist-serving perf by clearing CPU affinity by @Shixiaowei02 in #9549
  • [#9550][feat] Add NVFP4 Cutlass MoE kernels for AutoDeploy by @nzmora-nvidia in #9551
  • [TRTLLM-9547][https://nvbugs/5688388][fix] fix: Reducing num request in disagg test to speed up by @pcastonguay in #9598
  • [TRTLLM-8946][feat] Improved heuristics to detect shardable regions by @greg-kwasniewski1 in #9200
  • [#9632][feat] Support EXTRA_WHEEL_BUILD_ARGS during wheel build by @michael132 in #9633
  • [None][chore] Waive test failing on pre-merge by @brb-nv in #9638
  • [None][chore] Remove traceback dump for multimodal input processor by @chang-l in #9634
  • [None][chore] Fix trtllm-eval and move GroupedGemmInputsHelper by @syuoni in #9612
  • [https://nvbugs/5698434][fix] Use separate weight mapper for draft by @amukkara in #9607
  • [TRTLLM-7101][infra] Reuse passed tests by @yiqingy0 in #6894
  • [None][test] Remove duplicate test cases by @yufeiwu-nv in #9623
  • [None][feat] Add RocketKV usage doc and e2e accuracy test on LongBenchV2 by @heyuhhh in #9572
  • [TRTLLM-9242][doc] Add examples showcasing openai compatible APIs by @JunyiXu-nv in #9520
  • [None][chore] AutoDeploy update cuda stream manager for multi-device by @suyoggupta in #9575
  • [TRTLLM-9391][chore] Automatically estimate required workspace. by @bobboli in #9535
  • [https://nvbugs/5708475][fix] Fix e2e eval accuracy for helix parallelism by @brb-nv in #9647
  • [https://nvbugs/5561153][test] Fix log error for perf test by @fredricz-20070104 in #9622
  • [TRTLLM-8241][feat] Aliasing to comply to LlmArgs by @LinPoly in #9586
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #9593
  • [TRTLLM-6842][feat] Support Response API for general purpose by @JunyiXu-nv in #9392
  • [None][test] Update Qwen3-next accuracy testing by setting the cuda … by @nv-guomingz in #9613
  • [None][feat] update trtllm-gen nvfp4 kernels with better performance by @PerkzZheng in #9510
  • [None][doc] Replace the tensorrt icon with torch icon on overview.md by @nv-guomingz in #9644
  • [https://nvbugs/5705197][chore] Unwaive timeout disagg tests by @pcastonguay in #9637
  • [https://nvbugs/5552132][fix] Enable LoRa for GPT OSS Torch by @moraxu in #8253
  • [None][fix] Fix wide ep MoE error by @Tabrizian in #9642
  • [https://nvbugs/5702795][fix] Remove the warning message for aten.log. by @nv-guomingz in #9665
  • [https://nvbugs/5693853][fix] Fix error handling when querying machin… by @galagam in #9483
  • [OMNIML-2932] [feat] nvfp4 awq support by @meenchen in #8698
  • [#9643][fix] AutoDeploy: fix nano sharding config by @lucaslie in #9668
  • [#9147][feat] AutoDeploy: Draft Target Speculative Decoding by @govind-ramnarayan in #9275
  • [None][feat] Update Qwen3CodeToolParser to align tool-calling parameters by @Wanli-Jiang in #9540
  • [TRTLLM-7181][infra] Generate test results when pytest timeout happens by @yiqingy0 in #9396
  • [TRTLLM-9522][fix] restore trtllm-serve mm_embedding_serve by @ixlmar in #9669
  • [TRTLLM-5093][infra] Write env variables to a file in the interactive debug session by @yiqingy0 in #6792
  • [None][fix] fix error when processing batches containing both text and mm data by @Nekofish-L in #8381
  • [TRTLLM-7073][feat] Support torch compile for PP for Llama and DeepSeekV3 by @liji-nv in #7838
  • [None][feat] Add weights initialization and context phase parser to layer-wise benchmarks by @yuantailing in #9667
  • [TRTLLM-8274][feat] Check if executor is shutdown in /health entrypoint by @JunyiXu-nv in #9057
  • [#8733][feat] Add Llama4 MoE handling to AutoDeploy by @tcherckez-nvidia in #9556
  • [None][ci] unwaive tests by @Superjomn in #9651
  • [None][feat] Add NIXL-LIBFABRIC support by @zackyoray in #9225
  • [None][test] rename wide ep and disagg metric name in perf test by @ruodil in #9704
  • [https://nvbugs/5467531][fix] Unwaive fused_moe all to all test with DeepEPLowLatency by @liji-nv in #9617
  • [None][fix] Recover TRTLLM MoE Perf for DEP by @rosenrodt in #9562
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #9662
  • [None][fix] Fix TLLM_SPEC_DECODE_FORCE_NUM_ACCEPTED_TOKENS for MTP/EAGLE by @achartier in #9608
  • [None][infra] Add container notices and documentation by @pdrake-nv in #9185
  • [TRTLLM-5312][infra] Add triton trigger rules by @yiqingy0 in #6440
  • [None][doc] Add feature docs for helix parallelism by @brb-nv in #9684
  • [TRTLLM-9579][infra] Set mergeWaiveList stage UNSTABLE when there is any issue by @yiqingy0 in #9692
  • [None][doc] Added line about partial reuse by @thorjohnsen in #7846
  • [TRTLLM-8920][feat] decouple disagg service from fastapi by @reasonsolo in #8714
  • [https://nvbugs/5633340][fix] start disagg workers and servers on free ports by @reasonsolo in #9694
  • [TRTLLM-9562] [doc] Add Deployment Guide for Kimi K2 Thinking on TensorRT LLM - Blackwell by @kaiyux in #9711
  • [#9602][feat] AutoDeploy: Support TRTLLM Sampler by @govind-ramnarayan in #9641
  • [None] [tests] Unwaive EPLB tests by @kaiyux in #9625
  • [https://nvbugs/5518713][test] Refactor core test lists by merging with llm_perf_cluster.yml by @yufeiwu-nv in #9714
  • [TRTLLM-7136][feat] Update load_weights method to include mapping parameter in checkpoint loaders by @Funatiq in #9583
  • [None][refactor] Improve request processing function in sampler by @Funatiq in #9671
  • [https://nvbugs/5670672][fix] Fix flaky KV connector tests by @jthomson04 in #9676
  • [None][infra] Update allowed list 20251204 by @yuanjingx87 in #9718
  • [None][feat] AutoDeploy: Perf optimization for Attention and rmsnorm by @nvchenghaoz in #9719
  • [None][chore] Waive flakey disagg tests by @mikeiovine in #9749
  • [None][chore] Weekly mass integration of release/1.1 by @mikeiovine in #9594
  • [None][fix] Fix triton moe load_weight by @shuyixiong in #9649
  • [None][fix] fix a bug: deepseek_fp8_block_scales in TRTLLMGEN-MoE use 2D x_sf instead of 1D by @xxi-nv in #9658
  • [TRTLLM-9372][feat] Enable CuteDSL MoE with Large EP by @syuoni in #9592
  • [TRTLLM-9522][chore] implement default attach_multimodal_embeddings by @ixlmar in #9664
  • [TRTLLM-9660][feat] Convert cuteDSL GEMM to opt-in feature by @longlee0622 in #9682
  • [None][fix] enable hmac in RPC by @Superjomn in #9745

New Contributors

Full Changelog: v1.2.0rc4...v1.2.0rc5