TensorRT-LLM 0.9.0 Release
Hi,
We are very pleased to announce the 0.9.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.
This update includes:
- Model Support
- Support distil-whisper, thanks to the contribution from @Bhuvanesh09 in PR #1061
- Support HuggingFace StarCoder2
- Support VILA
- Support Smaug-72B-v0.1
- Migrate BLIP-2 examples to
examples/multimodal
- Features
- [BREAKING CHANGE] TopP sampling optimization with deterministic AIR TopP algorithm is enabled by default
- [BREAKING CHANGE] Support embedding sharing for Gemma
- Add support to context chunking to work with KV cache reuse
- Enable different rewind tokens per sequence for Medusa
- BART LoRA support (limited to the Python runtime)
- Enable multi-LoRA for BART LoRA
- Support
early_stopping=Falsein beam search for C++ Runtime - Add logits post processor to the batch manager (see docs/source/batch_manager.md#logits-post-processor-optional)
- Support import and convert HuggingFace Gemma checkpoints, thanks for the contribution from @mfuntowicz in #1147
- Support loading Gemma from HuggingFace
- Support auto parallelism planner for high-level API and unified builder workflow
- Support run
GptSessionwithout OpenMPI #1220 - Medusa IFB support
- [Experimental] Support FP8 FMHA, note that the performance is not optimal, and we will keep optimizing it
- More head sizes support for LLaMA-like models
- Ampere (sm80, sm86), Ada (sm89), Hopper(sm90) all support head sizes [32, 40, 64, 80, 96, 104, 128, 160, 256] now.
- OOTB functionality support
- T5
- Mixtral 8x7B
- API
- C++
executorAPI- Add Python bindings, see documentation and examples in
examples/bindings - Add advanced and multi-GPU examples for Python binding of
executorC++ API, seeexamples/bindings/README.md - Add documents for C++
executorAPI, seedocs/source/executor.md
- Add Python bindings, see documentation and examples in
- High-level API (refer to
examples/high-level-api/README.mdfor guidance)- [BREAKING CHANGE] Reuse the
QuantConfigused intrtllm-buildtool, support broader quantization features - Support in
LLM()API to accept engines built bytrtllm-buildcommand - Add support for TensorRT-LLM checkpoint as model input
- Refine
SamplingConfigused inLLM.generateorLLM.generate_asyncAPIs, with the support of beam search, a variety of penalties, and more features - Add support for the StreamingLLM feature, enable it by setting
LLM(streaming_llm=...) - Migrate Mixtral to high level API and unified builder workflow
- [BREAKING CHANGE] Reuse the
- [BREAKING CHANGE] Refactored Qwen model to the unified build workflow, see
examples/qwen/README.mdfor the latest commands - [BREAKING CHANGE] Move LLaMA convert checkpoint script from examples directory into the core library
- [BREAKING CHANGE] Refactor GPT with unified building workflow, see
examples/gpt/README.mdfor the latest commands - [BREAKING CHANGE] Removed all the lora related flags from convert_checkpoint.py script and the checkpoint content to
trtllm-buildcommand, to generalize the feature better to more models - [BREAKING CHANGE] Removed the use_prompt_tuning flag and options from convert_checkpoint.py script and the checkpoint content, to generalize the feature better to more models. Use the
trtllm-build --max_prompt_embedding_table_sizeinstead. - [BREAKING CHANGE] Changed the
trtllm-build --world_sizeflag to--auto_parallelflag, the option is used for auto parallel planner only. - [BREAKING CHANGE]
AsyncLLMEngineis removed,tensorrt_llm.GenerationExecutorclass is refactored to work with both explicitly launching withmpirunin the application level, and accept an MPI communicator created bympi4py - [BREAKING CHANGE]
examples/serverare removed, seeexamples/appinstead. - [BREAKING CHANGE] Remove LoRA related parameters from convert checkpoint scripts
- [BREAKING CHANGE] Simplify Qwen convert checkpoint script
- [BREAKING CHANGE] Remove
modelparameter fromgptManagerBenchmarkandgptSessionBenchmark
- C++
- Bug fixes
- Fix a weight-only quant bug for Whisper to make sure that the
encoder_input_len_rangeis not 0, thanks to the contribution from @Eddie-Wang1120 in #992 - Fix the issue that log probabilities in Python runtime are not returned #983
- Multi-GPU fixes for multimodal examples #1003
- Fix wrong
end_idissue for Qwen #987 - Fix a non-stopping generation issue #1118 #1123
- Fix wrong link in examples/mixtral/README.md #1181
- Fix LLaMA2-7B bad results when int8 kv cache and per-channel int8 weight only are enabled #967
- Fix wrong
head_sizewhen importing Gemma model from HuggingFace Hub, thanks for the contribution from @mfuntowicz in #1148 - Fix ChatGLM2-6B building failure on INT8 #1239
- Fix wrong relative path in Baichuan documentation #1242
- Fix wrong
SamplingConfigtensors inModelRunnerCpp#1183 - Fix error when converting SmoothQuant LLaMA #1267
- Fix the issue that
examples/run.pyonly load one line from--input_file - Fix the issue that
ModelRunnerCppdoes not transferSamplingConfigtensor fields correctly #1183
- Fix a weight-only quant bug for Whisper to make sure that the
- Benchmark
- Add emulated static batching in
gptManagerBenchmark - Support arbitrary dataset from HuggingFace for C++ benchmarks, see “Prepare dataset” section in
benchmarks/cpp/README.md - Add percentile latency report to
gptManagerBenchmark
- Add emulated static batching in
- Performance
- Infra
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.02-py3 - Base Docker image for TensorRT-LLM backend is updated to
nvcr.io/nvidia/tritonserver:24.02-py3 - The dependent TensorRT version is updated to 9.3
- The dependent PyTorch version is updated to 2.2
- The dependent CUDA version is updated to 12.3.2 (a.k.a. 12.3 Update 2)
- Base Docker image for TensorRT-LLM is updated to
Currently, there are two key branches in the project:
- The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
- The main branch is the dev branch. It is more experimental.
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team