You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+18-23Lines changed: 18 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,6 +16,7 @@
16
16
17
17
## Latest News
18
18
19
+
-\[2024/9/10\][Post-Training Quantization of LLMs with NVIDIA NeMo and TensorRT Model Optimizer](https://developer.nvidia.com/blog/post-training-quantization-of-llms-with-nvidia-nemo-and-nvidia-tensorrt-model-optimizer/)
19
20
-\[2024/8/28\][Boosting Llama 3.1 405B Performance up to 44% with TensorRT Model Optimizer on NVIDIA H200 GPUs](https://developer.nvidia.com/blog/boosting-llama-3-1-405b-performance-by-up-to-44-with-nvidia-tensorrt-model-optimizer-on-nvidia-h200-gpus/)
20
21
-\[2024/8/28\][Up to 1.9X Higher Llama 3.1 Performance with Medusa](https://developer.nvidia.com/blog/low-latency-inference-chapter-1-up-to-1-9x-higher-llama-3-1-performance-with-medusa-on-nvidia-hgx-h200-with-nvlink-switch/)
21
22
-\[2024/08/15\] New features in recent releases: [Cache Diffusion](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/diffusers/cache_diffusion), [QLoRA workflow with NVIDIA NeMo](https://docs.nvidia.com/nemo-framework/user-guide/latest/sft_peft/qlora.html), and more. Check out [our blog](https://developer.nvidia.com/blog/nvidia-tensorrt-model-optimizer-v0-15-boosts-inference-performance-and-expands-model-support/) for details.
@@ -51,39 +52,33 @@ For enterprise users, the 8-bit quantization with Stable Diffusion is also avail
51
52
52
53
Model Optimizer is available for free for all developers on [NVIDIA PyPI](https://pypi.org/project/nvidia-modelopt/). This repository is for sharing examples and GPU-optimized recipes as well as collecting feedback from the community.
See the [installation guide](https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/2_installation.html) for more fine-grained control over the installation.
63
-
64
-
Make sure to also install example-specific dependencies from their respective `requirements.txt` files if any.
65
-
66
-
### Docker
67
-
68
-
After installing the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit),
69
-
please run the following commands to build the Model Optimizer example docker container which has all the necessary
59
+
After installing the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html),
60
+
please run the following commands to build the Model Optimizer docker container which has all the necessary
70
61
dependencies pre-installed for running the examples.
NOTE: Unless specified otherwise, all example READMEs assume they are using the ModelOpt docker image for running the examples.
79
+
See the [installation guide](https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/2_installation.html) for more details on alternate pre-built docker images or installation in a local environment.
85
80
86
-
Alternatively for PyTorch, you can also use [NVIDIA NGC PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags) with Model Optimizer pre-installed starting from 24.06 container. Make sure to update the Model Optimizer version to the latest one if not already.
81
+
NOTE: Unless specified otherwise, all example READMEs assume they are using the above ModelOpt docker image for running the examples. The example specific dependencies are required to be install separately from their respective `requirements.txt` files if not using the ModelOpt's docker image.
87
82
88
83
## Techniques
89
84
@@ -97,7 +92,7 @@ Sparsity is a technique to further reduce the memory footprint of deep learning
97
92
98
93
### Pruning
99
94
100
-
Pruning is a technique to reduce the model size and accelerate the inference by removing unnecessary weights. Model Optimizer provides Python APIs to prune Linear and Conv layers, and Transformer attention heads, MLP, and depth.
95
+
Pruning is a technique to reduce the model size and accelerate the inference by removing unnecessary weights. Model Optimizer provides Python APIs to prune Linear and Conv layers, and Transformer attention heads, MLP, embedding hidden size and number of layers (depth).
Copy file name to clipboardExpand all lines: diffusers/quantization/README.md
+22-17Lines changed: 22 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -120,32 +120,37 @@ Note, the engines must be built on the same GPU, and ensure that the INT8 engine
120
120
- Run the above txt2img example command again. You can compare the generated images and latency for fp16 vs int8.
121
121
Similarly, you could run end-to-end pipeline with Model Optimizer quantized backbone and corresponding examples in demoDiffusion with other diffusion models.
122
122
123
-
### ModelOPT Python-native TRT Pipeline
123
+
### Running the inference pipeline with DeviceModel
124
124
125
-
For our testing pipeline, all you need to do is generate the engine file using `trtexec`. The pipeline will then automatically load it for TensorRT inference. For more details, you can check the available options by running:
125
+
DeviceModel is an interface designed to run TensorRT engines like torch models. It takes torch inputs and returns torch outputs. Under the hood, DeviceModel exports a torch checkpoint to ONNX and then generates a TensorRT engine from it. This allows you to swap the backbone of the diffusion pipeline with DeviceModel and execute the pipeline for your desired prompt.<br><br>
126
126
127
-
```bash
128
-
python trt_infer.py --help
129
-
```
130
-
131
-
To run the pipeline, execute the following command:
127
+
Generate a quantized torch checkpoint using the command shown below:
--prompt "A cat holding a sign that says hello world" \
147
+
[--restore-from ./{MODEL}_fp8.pt] \
148
+
[--onnx-load-path {ONNX_DIR}] \
149
+
[--trt_engine-path {ENGINE_DIR}]
146
150
```
147
151
148
-
After that, you can use the pipe as you normally would with the Diffusers pipeline on your local machine, and it will automatically run in TensorRT without any additional changes, which will run faster than the PyTorch runtime.
152
+
This script will save the output image as `./{MODEL}.png` and report the latency of the TensorRT backbone.
153
+
To generate the image with FP16|BF16 precision, you can run the command shown above without the `--restore-from` argument.<br><br>
0 commit comments