You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Minimizing inference costs presents a significant challenge as generative AI models continue to grow in complexity and size.
52
-
The **NVIDIA TensorRT Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization techniques including [quantization](#quantization), [sparsity](#sparsity), [distillation](#distillation), and [pruning](#pruning) to compress models.
53
+
The **NVIDIA TensorRT Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization techniques including [quantization](#quantization), [distillation](#distillation), [pruning](#pruning), and [sparsity](#sparsity) to compress models.
53
54
It accepts a torch or [ONNX](https://github.com/onnx/onnx) model as inputs and provides Python APIs for users to easily stack different model optimization techniques to produce an optimized quantized checkpoint.
54
55
Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) or [TensorRT](https://github.com/NVIDIA/TensorRT).
55
56
ModelOpt is integrated with [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) and [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) for training-in-the-loop optimization techniques.
@@ -72,7 +73,7 @@ cd TensorRT-Model-Optimizer
72
73
73
74
# Build the docker (will be tagged `docker.io/library/modelopt_examples:latest`)
74
75
# You may customize `docker/Dockerfile` to include or exclude certain dependencies you may or may not need.
75
-
bash docker/build.sh
76
+
./docker/build.sh
76
77
77
78
# Run the docker image
78
79
docker run --gpus all -it --shm-size 20g --rm docker.io/library/modelopt_examples:latest bash
@@ -91,18 +92,18 @@ NOTE: Unless specified otherwise, all example READMEs assume they are using the
91
92
92
93
Quantization is an effective model optimization technique for large models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality. Model Optimizer enables highly performant quantization formats including FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, and Double Quantization with easy-to-use Python APIs. Both Post-training quantization (PTQ) and Quantization-aware training (QAT) are supported.
93
94
94
-
### Sparsity
95
+
### Distillation
95
96
96
-
Sparsity is a technique to further reduce the memory footprint of deep learning models and accelerate the inference. Model Optimizer Python APIs to apply weight sparsity to a given model. It also supports [NVIDIA 2:4 sparsity pattern](https://arxiv.org/pdf/2104.08378) and various sparsification methods, such as [NVIDIA ASP](https://github.com/NVIDIA/apex/tree/master/apex/contrib/sparsity) and [SparseGPT](https://arxiv.org/abs/2301.00774).
97
+
Knowledge Distillation allows for increasing the accuracy and/or convergence speed of a desired model architecture
98
+
by using a more powerful model's learned features to guide a student model's objective function into imitating it.
97
99
98
100
### Pruning
99
101
100
102
Pruning is a technique to reduce the model size and accelerate the inference by removing unnecessary weights. Model Optimizer provides Python APIs to prune Linear and Conv layers, and Transformer attention heads, MLP, embedding hidden size and number of layers (depth).
101
103
102
-
### Distillation
104
+
### Sparsity
103
105
104
-
Knowledge Distillation allows for increasing the accuracy and/or convergence speed of a desired model architecture
105
-
by using a more powerful model's learned features to guide a student model's objective function into imitating it.
106
+
Sparsity is a technique to further reduce the memory footprint of deep learning models and accelerate the inference. Model Optimizer Python APIs to apply weight sparsity to a given model. It also supports [NVIDIA 2:4 sparsity pattern](https://arxiv.org/pdf/2104.08378) and various sparsification methods, such as [NVIDIA ASP](https://github.com/NVIDIA/apex/tree/master/apex/contrib/sparsity) and [SparseGPT](https://arxiv.org/abs/2301.00774).
106
107
107
108
## Examples
108
109
@@ -121,15 +122,18 @@ by using a more powerful model's learned features to guide a student model's obj
121
122
-[ONNX PTQ](./onnx_ptq/README.md) shows how to quantize the ONNX models in INT4 or INT8 quantization mode. The examples also include the deployment of quantized ONNX models using TensorRT.
122
123
-[Distillation for LLMs](./llm_distill/README.md) demonstrates how to use Knowledge Distillation, which can increasing the accuracy and/or convergence speed for finetuning / QAT.
123
124
-[Chained Optimizations](./chained_optimizations/README.md) shows how to chain multiple optimizations together (e.g. Pruning + Distillation + Quantization).
125
+
-[Model Hub](./model_hub/) provides an example to deploy and run quantized Llama 3.1 8B instruct model from Nvidia's Hugging Face model hub on both TensorRT-LLM and vLLM.
124
126
125
-
## Support Matrix
127
+
## Model Support Matrix
126
128
127
129
- For LLM quantization, please refer to this [support matrix](./llm_ptq/README.md#model-support-list).
128
-
- For Diffusion, the Model Optimizer supports [Stable Diffusion 1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5), [Stable Diffusion XL](https://huggingface.co/papers/2307.01952), and [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo).
130
+
- For VLM quantization, please refer to this [support matrix](./vlm_ptq/README.md#model-support-list).
131
+
- For Diffusion, Model Optimizer supports [FLUX](https://huggingface.co/black-forest-labs/FLUX.1-dev), [Stable Diffusion 3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [Stable Diffusion XL](https://huggingface.co/papers/2307.01952), [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo), and [Stable Diffusion 2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1).
132
+
- For speculative decoding, please refer to this [support matrix](./speculative_decoding/README.md#model-support-list).
129
133
130
134
## Benchmark
131
135
132
-
Please find the benchmarks [here](./benchmark.md).
136
+
Please find the benchmarks at [here](./benchmark.md).
133
137
134
138
## Quantized Checkpoints
135
139
@@ -142,3 +146,7 @@ Please see our [product roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimi
142
146
## Release Notes
143
147
144
148
Please see Model Optimizer Changelog [here](https://nvidia.github.io/TensorRT-Model-Optimizer/reference/0_changelog.html).
149
+
150
+
## Contributing
151
+
152
+
At the moment, we are not accepting external contributions. However, this will soon change after we open source our library in early 2025 with a focus on extensibility. We welcome any feedback and feature requests. Please open an issue if you have any suggestions or questions.
0 commit comments