Skip to content

v0.21.0

Latest
Compare
Choose a tag to compare
@QiJune QiJune released this 04 Aug 14:23
· 1174 commits to main since this release
751d5f1

TensorRT-LLM Release 0.21.0

Key Features and Enhancements

  • Model Support
    • Added Gemma3 VLM support
  • Features
    • Added large-scale EP support
    • Integrated NIXL into the communication layer of the disaggregated service
    • Added fabric Memory support for KV Cache Transfer
    • Added MCP in ScaffoldingLLM
    • Added support for w4a8_mxfp4_fp8 quantization
    • Added support for fp8 rowwise quantization
    • Added generation logits support in TRTLLM Sampler
    • Added log probs support in TRTLLM Sampler
    • Optimized TRTLLM Sampler perf single beam single step
    • Enabled Disaggregated serving for Qwen-3
    • Added EAGLE3 support for Qwen-3
    • Fused finalize and allreduce for Qwen-MoE model
    • Refactored Fused MoE module
    • Added support for chunked attention on Blackwell and Hopper
    • Introduced sliding-window attention kernels for the generation phase on Blackwell
    • Updated DeepSeek FP8 TRT-LLM Gen cubins to improve performance in large batch size scenarios
    • Added FP8 block-scale GEMM support on SM89
    • Enabled overlap scheduler between draft forwards
    • Added Piecewise cuda graph support for MLA
    • Added model-agnostic one-engine eagle3
    • Enabled Finalize + Allreduce + add + rmsnorm fusion
    • Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner
    • Added support for Eagle3 + disaggregated serving in two model speculative decoding flow
    • Validated Llama 3.1 models on H200 NVL
  • Benchmark:
    • Added all_reduce.py benchmark script for testing
    • Added beam width to trtllm-bench latency command
    • Fixed trtllm-bench iter_stats and cuda_graph_batch_sizes errors
    • Enabled trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA
    • Supported post_proc for bench
    • Added no_kv_cache_reuse option and streaming support for trtllm serve bench

Infrastructure Changes

  • The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.05-py3.
  • The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:25.05-py3.
  • The dependent public PyTorch version is updated to 2.7.1.
  • The dependent TensorRT version is updated to 10.11.
  • The dependent NVIDIA ModelOpt version is updated to 0.31.
  • The dependent NCCL version is updated to 2.27.5.

API Changes

  • Set _AutoDeployLlmArgs as primary config object
  • Removed decoder request from decoder interface
  • Enhanced the torch_compile_config in llm args
  • Removed the redundant use_kv_cache field from PytorchConfig
  • Moved allreduce_strategy from committed api to reference

Fixed Issues

  • Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
  • Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
  • Fixed cuda graph padding for spec decoding (#4853)
  • Fixed llama 4 long context issue (#4809)
  • Fixed max_num_sequences calculation with overlap scheduling (#4532)
  • Fixed chunked prefill + overlap scheduling (#5761)
  • Fixed trtllm-bench hang issue due to LLM API IPC (#4798)
  • Fixed index out of bounds error in spec decoding (#5954)
  • Fixed MTP illegal memory access in cuda graph warmup (#5947)
  • Fixed no free slots error with spec decode + disagg (#5975)
  • Fixed one-off attention window size for Gemma3 1B (#5564)

Known Issues

  • accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken.
  • Enabling disaggregated serving, MTP, and the overlap scheduler at the same time can lead to accuracy problems.
  • In 0.21, full chunked attention support has been added to make sure LLaMA4 model can functionally run with > 8K seq length, while there is a known performance regression(only affect LLaMA4 model) on Hopper due to this functional enhancement. The root cause of the regression has been identified already and the fix will be part of the future release.

What's Changed

New Contributors

Full Changelog: v0.21.0rc2...v0.21.0