Update on the development branch #847

kaiyux · 2024-01-09T13:10:02Z

kaiyux
Jan 9, 2024
Maintainer

Hi,

The TensorRT-LLM team is pleased to announce that we are pushing an update to the development branch (and the Triton backend) this January 9th, 2024.

This update includes:

Model Support
- Add example for multimodal models (BLIP with OPT or T5, LlaVA)
Features
- Smooth Quantization support for ChatGLM2-6B / ChatGLM3-6B / ChatGLM2-6B-32K
- Out-of-the-box support for the QWEN model
- Support for returning context and/or generation logits in the Triton backend
API
- Add a set of High-level APIs for end-to-end generation tasks, the features are as below
  - ModelConfig() as a clean configuration interface for LLM tasks
  - LLM() for LLM pipelines, it will trigger the necessary engine building or model quantization silently in the background
  - generate() API for batched offline inference, both single-GPU and multi-GPU supported
  - generate_async() API for asynchronous offline inference on a single GPU, streaming mode is supported
Bug fixes
- Add pickle support for InferenceRequest GptManager pybind 2/4TP run demo #701
- Fix Mixtral-8x7b build failure with custom_all_reduce Mixtral-8x7b build fails with custom_all_reduce #825
Performance
- Performance optimization of beam search kernel
- Increase default freeGpuMemoryFraction parameter from 0.85 to 0.9 for higher throughput
Documentation
- Add documentation for best practices for tuning the performance of TensorRT-LLM (See docs/source/perf_best_practices.md)
- Add documentation for Falcon AWQ support (See examples/falcon/README.md)

Thanks,
The TensorRT-LLM Engineering Team