Update on the development branch #2756

DanBlanaru · 2025-02-06T13:27:27Z

DanBlanaru
Feb 6, 2025
Collaborator

Hello,

The TensorRT-LLM team is pleased to announce that we have restarted the updates to the development branch (and the Triton backend) starting today.

Today's update includes the changes made with release 0.17:

Model Support
- Added InternLM-XComposer2 support. Refer to “InternLM-XComposer2” section in examples/multimodal/README.md.
Features
- Blackwell support
  - NOTE: pip installation is not supported for TRT-LLM 0.17 on Blackwell platforms only. Instead, it is recommended that the user build from source using NVIDIA NGC 25.01 PyTorch container.
  - Added support for B200.
  - Added support for GeForce RTX 50 series using Windows Subsystem for Linux (WSL) for limited models.
  - Added NVFP4 Gemm support for Llama and Mixtral models.
  - Added NVFP4 support for the LLM API and trtllm-bench command.
  - GB200 NVL is not fully supported.
  - Added benchmark script to measure perf benefits of KV cache host offload with expected runtime improvements from GH200.
- PyTorch workflow
  - The PyTorch workflow is an experimental feature in tensorrt_llm._torch. The following is a list of supported infrastructure, models, and features that can be used with the PyTorch workflow.
  - Added support for H100/H200/B200.
  - Added support for Llama models, Mixtral, QWen, Vila.
  - Added support for FP16/BF16/FP8/NVFP4 Gemm and fused Mixture-Of-Experts (MOE), FP16/BF16/FP8 KVCache.
  - Added custom context and decoding attention kernels support via PyTorch custom op.
  - Added support for chunked context (default off).
  - Added CudaGraph support for decoding only.
  - Added overlap scheduler support to overlap prepare inputs and model forward by decoding 1 extra token.
- Added FP8 context FMHA support for the W4A8 quantization workflow.
- Added ModelOpt quantized checkpoint support for the LLM API.
- Added support for min_p. Refer to https://arxiv.org/pdf/2407.01082.
- Thanks for the contribution from @pathorn in Use first bad_words as extra parameters, and implement min-p #1536.
- This also addresses Feature Request: Add Min-P sampling layer #1154 and Any chance to adopt min_p sampling? #1683.
- Added FP8 support for encoder-decoder models. Refer to the “FP8 Post-Training Quantization” section in examples/enc_dec/README.md.
- Added up and gate projection fusion support for LoRA modules.
- Support for DoRA. See examples/dora/README.md.
- Fall back to normal generation, if numDraftTokens == 0 in Target-Draft model speculative decoding.
API
- [BREAKING CHANGE] paged_context_fmha and fp8_context_fmha are enabled by default.
- [BREAKING CHANGE] KV cache reuse is enabled automatically when paged_context_fmha is enabled.
- [BREAKING CHANGE] tokens_per_block is set to 32 by default.
- Added --concurrency support for the throughput subcommand of trtllm-bench.
- Responses with errors now show the error text.
- Extended the details reported in the Speculative Decoding metrics. See SpeculativeDecodingMetrics.
Bug fixes
- Fixed incorrect LoRA output dimension. Thanks for the contribution from @akhoroshev in bugfix/incorrect lora out dims #2484.
- Added NVIDIA H200 GPU into the cluster_key for auto parallelism feature. ([feature request] Can we add H200 in infer_cluster_key() method? #2552)
- Fixed a typo in the __post_init__ function of LLmArgs Class. Thanks for the contribution from @topenkoff in Fix kwarg name #2691.
- Fixed workspace size issue in the GPT attention plugin. Thanks for the contribution from @AIDC-AI.
- Fixed Deepseek-V2 model accuracy.
Infrastructure changes
- The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.01-py3.
- The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:25.01-py3.
- The dependent TensorRT version is updated to 10.8.0.
- The dependent CUDA version is updated to 12.8.0.
- The dependent ModelOpt version is updated to 0.23 for Linux platform, while 0.17 is still used on Windows platform.
- Consolidated the tensorrt_llm version from various examples’ requirements.txt into a single constraints.txt file.
Known Issues
- Need --extra-index-url https://pypi.nvidia.com when running pip install tensorrt-llm due to new third-party dependencies.
- The PYPI SBSA wheel is incompatible with PyTorch 2.5.1 due to a break in the PyTorch ABI/API, as detailed in the related GitHub issue.

akhoroshev · 2025-02-06T23:34:02Z

akhoroshev
Feb 6, 2025

Fixed workspace size issue in the GPT attention plugin. Thanks for the contribution from https://github.com/AIDC-AI.

Hi! Could you please show the difference or explain the problem?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update on the development branch #2756

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Update on the development branch #2756

Uh oh!

DanBlanaru Feb 6, 2025 Collaborator

Replies: 1 comment

Uh oh!

akhoroshev Feb 6, 2025

DanBlanaru
Feb 6, 2025
Collaborator

akhoroshev
Feb 6, 2025