Update on the development branch #2009

kaiyux · 2024-07-23T15:09:48Z

kaiyux
Jul 23, 2024
Maintainer

Hi,

The TensorRT-LLM team is pleased to announce that we have pushed an update to the development branch (and the Triton backend) this July 23, 2024.

This update includes:

Model Support
- Supported LLaMA 3.1 model.
- Supported Qwen 2 model.
Features
- Supported gelu_pytorch_tanh activation function, thanks to the contribution from @ttim in Support gelu_pytorch_tanh activation function #1897.
- Added chunk_length parameter to Whisper, thanks to the contribution from @MahmoudAshraf97 in add chunk_length parameter to Whisper #1909.
API
- [BREAKING CHANGE] The use_custom_all_reduce argument is removed from trtllm-build.
- [BREAKING CHANGE] The multi_block_mode argument is moved from build stage (trtllm-build and builder API) to the runtime.
Bug fixes
- Fixed wrong pad token for the CodeQwen models. ([Feature] quantize_by_modelopt.py get_tokenizer is not suitable for CodeQwen1.5 7B Chat #1953)
- Fixed typo in cluster_infos defined in tensorrt_llm/auto_parallel/cluster_info.py, thanks to the contribution from @saeyoonoh in fix auto parallel cluster info typo #1987.
Infra
- The dependent TensorRT version is updated to 10.2.
Documentation
- Removed duplicated flags in the command at docs/source/reference/troubleshooting.md, thanks for the contribution from @hattizai in chore: remove duplicate flag #1937.

We are working on an update to the Llama FP8 code today or tomorrow (the current code works but we need to update the checkpoint converter).

Thanks,
The TensorRT-LLM Engineering Team