Skip to content

Commit dc1a9d1

Browse files
kevalmorabia97Chenjie Luo
authored andcommitted
nvidia-modelopt 0.17.0 examples release
1 parent 8a999e2 commit dc1a9d1

File tree

115 files changed

+7040
-7026
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

115 files changed

+7040
-7026
lines changed

README.md

Lines changed: 36 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,9 @@
1616

1717
## Latest News
1818

19+
- \[2024/8/28\] [Boosting Llama 3.1 405B Performance up to 44% with TensorRT Model Optimizer on NVIDIA H200 GPUs](https://developer.nvidia.com/blog/boosting-llama-3-1-405b-performance-by-up-to-44-with-nvidia-tensorrt-model-optimizer-on-nvidia-h200-gpus/)
20+
- \[2024/8/28\] [Up to 1.9X Higher Llama 3.1 Performance with Medusa](https://developer.nvidia.com/blog/low-latency-inference-chapter-1-up-to-1-9x-higher-llama-3-1-performance-with-medusa-on-nvidia-hgx-h200-with-nvlink-switch/)
21+
- \[2024/08/15\] New features in recent releases: [Cache Diffusion](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/diffusers/cache_diffusion), [QLoRA workflow with NVIDIA NeMo](https://docs.nvidia.com/nemo-framework/user-guide/latest/sft_peft/qlora.html), and more. Check out [our blog](https://developer.nvidia.com/blog/nvidia-tensorrt-model-optimizer-v0-15-boosts-inference-performance-and-expands-model-support/) for details.
1922
- \[2024/06/03\] Model Optimizer now has an experimental feature to deploy to vLLM as part of our effort to support popular deployment frameworks. Check out the workflow [here](./llm_ptq/README.md#deploy-fp8-quantized-model-using-vllm)
2023
- \[2024/05/08\] [Announcement: Model Optimizer Now Formally Available to Further Accelerate GenAI Inference Performance](https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/)
2124
- \[2024/03/27\] [Model Optimizer supercharges TensorRT-LLM to set MLPerf LLM inference records](https://developer.nvidia.com/blog/nvidia-h200-tensor-core-gpus-and-nvidia-tensorrt-llm-set-mlperf-llm-inference-records/)
@@ -30,14 +33,21 @@
3033
- [Techniques](#techniques)
3134
- [Quantization](#quantization)
3235
- [Sparsity](#sparsity)
36+
- [Distillation](#distillation)
37+
- [Pruning](#pruning)
3338
- [Examples](#examples)
3439
- [Support Matrix](#support-matrix)
3540
- [Benchmark](#benchmark)
3641
- [Release Notes](#release-notes)
3742

3843
## Model Optimizer Overview
3944

40-
Minimizing inference costs presents a significant challenge as generative AI models continue to grow in complexity and size. The **NVIDIA TensorRT Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization techniques including [quantization](#quantization) and [sparsity](#sparsity) to compress models. It accepts a torch or [ONNX](https://github.com/onnx/onnx) model as inputs and provides Python APIs for users to easily stack different model optimization techniques to produce an optimized quantized checkpoint. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) or [TensorRT](https://github.com/NVIDIA/TensorRT). Further integrations are planned for [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) and [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on [NVIDIA NIM](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/).
45+
Minimizing inference costs presents a significant challenge as generative AI models continue to grow in complexity and size.
46+
The **NVIDIA TensorRT Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization techniques including [quantization](#quantization), [sparsity](#sparsity), [distillation](#distillation), and [pruning](#pruning) to compress models.
47+
It accepts a torch or [ONNX](https://github.com/onnx/onnx) model as inputs and provides Python APIs for users to easily stack different model optimization techniques to produce an optimized quantized checkpoint.
48+
Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) or [TensorRT](https://github.com/NVIDIA/TensorRT).
49+
Further integrations are planned for [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) and [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) for training-in-the-loop optimization techniques.
50+
For enterprise users, the 8-bit quantization with Stable Diffusion is also available on [NVIDIA NIM](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/).
4151

4252
Model Optimizer is available for free for all developers on [NVIDIA PyPI](https://pypi.org/project/nvidia-modelopt/). This repository is for sharing examples and GPU-optimized recipes as well as collecting feedback from the community.
4353

@@ -46,15 +56,18 @@ Model Optimizer is available for free for all developers on [NVIDIA PyPI](https:
4656
### [PIP](https://pypi.org/project/nvidia-modelopt/)
4757

4858
```bash
49-
pip install "nvidia-modelopt[all]~=0.15.0" --extra-index-url https://pypi.nvidia.com
59+
pip install "nvidia-modelopt[all]~=0.17.0" --extra-index-url https://pypi.nvidia.com
5060
```
5161

5262
See the [installation guide](https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/2_installation.html) for more fine-grained control over the installation.
5363

64+
Make sure to also install example-specific dependencies from their respective `requirements.txt` files if any.
65+
5466
### Docker
5567

5668
After installing the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit),
57-
please run the following commands to build the Model Optimizer example docker container.
69+
please run the following commands to build the Model Optimizer example docker container which has all the necessary
70+
dependencies pre-installed for running the examples.
5871

5972
```bash
6073
# Build the docker
@@ -68,6 +81,8 @@ docker run --gpus all -it --shm-size 20g --rm docker.io/library/modelopt_example
6881
python -c "import modelopt"
6982
```
7083

84+
NOTE: Unless specified otherwise, all example READMEs assume they are using the ModelOpt docker image for running the examples.
85+
7186
Alternatively for PyTorch, you can also use [NVIDIA NGC PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags) with Model Optimizer pre-installed starting from 24.06 container. Make sure to update the Model Optimizer version to the latest one if not already.
7287

7388
## Techniques
@@ -78,7 +93,16 @@ Quantization is an effective model optimization technique for large models. Quan
7893

7994
### Sparsity
8095

81-
Sparsity is a technique to further reduce the memory footprint of deep learning models and accelerate the inference. Model Optimizer provides Python API `mts.sparsity()` to apply weight sparsity to a given model. `mts.sparsity()` supports [NVIDIA 2:4 sparsity pattern](https://arxiv.org/pdf/2104.08378) and various sparsification methods, such as [NVIDIA ASP](https://github.com/NVIDIA/apex/tree/master/apex/contrib/sparsity) and [SparseGPT](https://arxiv.org/abs/2301.00774).
96+
Sparsity is a technique to further reduce the memory footprint of deep learning models and accelerate the inference. Model Optimizer Python APIs to apply weight sparsity to a given model. It also supports [NVIDIA 2:4 sparsity pattern](https://arxiv.org/pdf/2104.08378) and various sparsification methods, such as [NVIDIA ASP](https://github.com/NVIDIA/apex/tree/master/apex/contrib/sparsity) and [SparseGPT](https://arxiv.org/abs/2301.00774).
97+
98+
### Pruning
99+
100+
Pruning is a technique to reduce the model size and accelerate the inference by removing unnecessary weights. Model Optimizer provides Python APIs to prune Linear and Conv layers, and Transformer attention heads, MLP, and depth.
101+
102+
### Distillation
103+
104+
Knowledge Distillation allows for increasing the accuracy and/or convergence speed of a desired model architecture
105+
by using a more powerful model's learned features to guide a student model's objective function into imitating it.
82106

83107
## Examples
84108

@@ -90,11 +114,17 @@ Sparsity is a technique to further reduce the memory footprint of deep learning
90114
- [PTQ for Diffusers](./diffusers/quantization/README.md) walks through how to quantize a diffusion model with FP8 or INT8, export to ONNX, and deploy with [TensorRT](https://github.com/NVIDIA/TensorRT/tree/release/10.0/demo/Diffusion). The Diffusers example in this repo is complementary to the [demoDiffusion example in TensorRT repo](https://github.com/NVIDIA/TensorRT/tree/release/10.0/demo/Diffusion#introduction) and includes FP8 plugins as well as the latest updates on INT8 quantization.
91115
- [QAT for LLMs](./llm_qat/README.md) demonstrates the recipe and workflow for Quantization-aware Training (QAT), which can further preserve model accuracy at low precisions (e.g., INT4, or 4-bit in [NVIDIA Blackwell platform](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)).
92116
- [Sparsity for LLMs](./llm_sparsity/README.md) shows how to perform Post-training Sparsification and Sparsity-aware fine-tuning on a pre-trained Hugging Face model.
117+
- [Pruning](./pruning/README.md) demonstrates how to optimally prune Linear and Conv layers, and Transformer attention heads, MLP, and depth using the Model Optimizer for following frameworks:
118+
- [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) / [NVIDIA Megatron-LM](https://github.com/NVIDIA/Megatron-LM) GPT-style models (e.g. Llama 3, Mistral NeMo, etc.)
119+
- Hugging Face language models like BERT and GPT-J
120+
- Computer Vision models like [NVIDIA Tao](https://developer.nvidia.com/tao-toolkit) framework detection models.
93121
- [ONNX PTQ](./onnx_ptq/README.md) shows how to quantize the ONNX models in INT4 or INT8 quantization mode. The examples also include the deployment of quantized ONNX models using TensorRT.
122+
- [Distillation for LLMs](./llm_distill/README.md) demonstrates how to use Knowledge Distillation, which can increasing the accuracy and/or convergence speed for finetuning / QAT.
123+
- [Chained Optimizations](./chained_optimizations/README.md) shows how to chain multiple optimizations together (e.g. Pruning + Distillation + Quantization).
94124

95125
## Support Matrix
96126

97-
- For LLMs, please refer to this [support matrix](./llm_ptq/README.md#model-support-list).
127+
- For LLM quantization, please refer to this [support matrix](./llm_ptq/README.md#model-support-list).
98128
- For Diffusion, the Model Optimizer supports [Stable Diffusion 1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5), [Stable Diffusion XL](https://huggingface.co/papers/2307.01952), and [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo).
99129

100130
## Benchmark
@@ -103,4 +133,4 @@ Please find the benchmarks [here](./benchmark.md).
103133

104134
## Release Notes
105135

106-
Please see Model Optimizer Changelog [here](https://nvidia.github.io/TensorRT-Model-Optimizer/reference/0_versions.html).
136+
Please see Model Optimizer Changelog [here](https://nvidia.github.io/TensorRT-Model-Optimizer/reference/0_changelog.html).

chained_optimizations/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
results**/

chained_optimizations/README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Chaining Multiple Optimizations Techniques
2+
3+
This example demonstrates how to chain multiple optimization techniques like Pruning, Distillation, and Quantization together to
4+
achieve the best performance on a given model.
5+
6+
## HuggingFace BERT Pruning + Distillation + Quantization
7+
8+
This example shows how to compress a [Hugging Face Bert large model for Question Answering](https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad)
9+
using the combination of `modelopt.torch.prune`, `modelopt.torch.distill` and `modelopt.torch.quantize`. More specifically, we will:
10+
11+
1. Prune the Bert large model to 50% FLOPs with GradNAS algorithm and fine-tune with distillation
12+
1. Quantize the fine-tuned model to INT8 precision with Post-Training Quantization (PTQ) and Quantize Aware Training (QAT) with distillation
13+
1. Export the quantized model to ONNX format for deployment with TensorRT
14+
15+
The main python file is [bert_prune_distill_quantize.py](./bert_prune_distill_quantize.py) and scripts for running it
16+
for all 3 steps are available in the [scripts](./scripts/) directory.
17+
More details on this example (including highlighted code snippets) can be found in the Model Optimizer documentation
18+
[here](https://nvidia.github.io/TensorRT-Model-Optimizer/examples/2_bert_prune_distill_quantize.html)

0 commit comments

Comments
 (0)