Update on the development branch #1690

kaiyux · 2024-05-28T12:26:58Z

kaiyux
May 28, 2024
Maintainer

Hi,

The TensorRT-LLM team is pleased to announce that we are pushing an update to the development branch (and the Triton backend) this May 28, 2024.

This update includes:

Model Support
- Support Jais, see examples/jais/README.md.
- Support DiT, see examples/dit/README.md.
- Support VILA 1.5.
- Support Video NeVA, see Video NeVAsection in examples/multimodal/README.md.
Features
- Update default model for Whisper to distil-whisper/distil-large-v3, thanks to the contribution from @IbrahimAmin1 in [feat]: Add Option to convert and run distil-whisper large-v3 #1337.
API
- [BREAKING CHANGE] Migrate Whisper to unified workflow (trtllm-build command), see documents: examples/whisper/README.md.
- [BREAKING CHANGE] Rename free_gpu_memory_fraction in ModelRunnerCpp to kv_cache_free_gpu_memory_fraction
- Add some more options to ModelRunnerCpp, including max_tokens_in_paged_kv_cache, kv_cache_enable_block_reuse and enable_chunked_context
- Python high level API
  - [BREAKING CHANGE] Remove enable_executor from tensorrt_llm.LLM API as it is using the C++ Executor API now.
  - [BREAKING CHANGE] Introduce OutputConfig in generate API.
  - Adapt BuildConfig to the tensorrt_llm.LLM API.
  - Log cleanup on the LLM construction phase, remove most of the trivial logs.
  - Support quitting immediately when users press Ctrl-C.
- [BREAKING CHANGE] Speculative decoding configurations unification
  - Introduction of SpeculativeDecodingMode.h to choose between different speculative decoding techniques.
  - Introduction of SpeculativeDecodingModule.h base class for speculative decoding techniques
  - Remove decodingMode.h
Bug fixes
- Fix stop and bad word list pointer offset in Python runtime, thanks to the contribution from @fjosw in [ModelRunner] Fix stop & bad word list pointer offset. #1486.
- Fix some typos for Whisper model, thanks to the contribution from @Pzzzzz5142 in Fix typo in examples/whisper, Fix examples/whisper/run_faster_whisper.py #1328.
- Fix Window attention with huge window size(such as 65k) encountered shared mem error #1424
- Fix zephyr-7b-beta fp16 engine outputs "\u68a6\u68a6\u68a6..." for long input ~7000 tokens #1529
- Fix export failure with CUDA driver < 526 and pynvml >= 11.5.0, thanks to the contribution from @CoderHam in [fix] export failure with CUDA driver < 526 and pynvml>=11.5.0 #1537.
- Fix an issue in NMT weight conversion, thanks to the contribution from @Pzzzzz5142 in Fix nmt weight conversion #1660.
- Fix LLaMA Smooth Quant conversion, thanks to the contribution from @lopuhin in Fix llama conversion with smooth quant #1650.
Infra
- Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.04-py3.
- Base Docker image for TensorRT-LLM backend is updated to nvcr.io/nvidia/tritonserver:24.04-py3.
- The dependent CUDA version is updated to 12.4.1.
- The dependent PyTorch version is updated to 2.3.0.

Thanks,
The TensorRT-LLM Engineering Team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update on the development branch #1690

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Update on the development branch #1690

Uh oh!

kaiyux May 28, 2024 Maintainer

Replies: 0 comments

kaiyux
May 28, 2024
Maintainer