Skip to content

Commit 1574baa

Browse files
committed
nvidia-modelopt 0.19.0 examples release
1 parent dc1a9d1 commit 1574baa

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

59 files changed

+2693
-3759
lines changed

README.md

Lines changed: 18 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616

1717
## Latest News
1818

19+
- \[2024/9/10\] [Post-Training Quantization of LLMs with NVIDIA NeMo and TensorRT Model Optimizer](https://developer.nvidia.com/blog/post-training-quantization-of-llms-with-nvidia-nemo-and-nvidia-tensorrt-model-optimizer/)
1920
- \[2024/8/28\] [Boosting Llama 3.1 405B Performance up to 44% with TensorRT Model Optimizer on NVIDIA H200 GPUs](https://developer.nvidia.com/blog/boosting-llama-3-1-405b-performance-by-up-to-44-with-nvidia-tensorrt-model-optimizer-on-nvidia-h200-gpus/)
2021
- \[2024/8/28\] [Up to 1.9X Higher Llama 3.1 Performance with Medusa](https://developer.nvidia.com/blog/low-latency-inference-chapter-1-up-to-1-9x-higher-llama-3-1-performance-with-medusa-on-nvidia-hgx-h200-with-nvlink-switch/)
2122
- \[2024/08/15\] New features in recent releases: [Cache Diffusion](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/diffusers/cache_diffusion), [QLoRA workflow with NVIDIA NeMo](https://docs.nvidia.com/nemo-framework/user-guide/latest/sft_peft/qlora.html), and more. Check out [our blog](https://developer.nvidia.com/blog/nvidia-tensorrt-model-optimizer-v0-15-boosts-inference-performance-and-expands-model-support/) for details.
@@ -51,39 +52,33 @@ For enterprise users, the 8-bit quantization with Stable Diffusion is also avail
5152

5253
Model Optimizer is available for free for all developers on [NVIDIA PyPI](https://pypi.org/project/nvidia-modelopt/). This repository is for sharing examples and GPU-optimized recipes as well as collecting feedback from the community.
5354

54-
## Installation
55+
## Installation / Docker
5556

56-
### [PIP](https://pypi.org/project/nvidia-modelopt/)
57+
Easiest way to get started with using Model Optimizer and additional dependencies (e.g. TensorRT-LLM deployment) is to start from our docker image.
5758

58-
```bash
59-
pip install "nvidia-modelopt[all]~=0.17.0" --extra-index-url https://pypi.nvidia.com
60-
```
61-
62-
See the [installation guide](https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/2_installation.html) for more fine-grained control over the installation.
63-
64-
Make sure to also install example-specific dependencies from their respective `requirements.txt` files if any.
65-
66-
### Docker
67-
68-
After installing the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit),
69-
please run the following commands to build the Model Optimizer example docker container which has all the necessary
59+
After installing the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html),
60+
please run the following commands to build the Model Optimizer docker container which has all the necessary
7061
dependencies pre-installed for running the examples.
7162

7263
```bash
73-
# Build the docker
74-
docker/build.sh
64+
# Clone the ModelOpt repository
65+
git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git
66+
cd TensorRT-Model-Optimizer
67+
68+
# Build the docker (will be tagged `docker.io/library/modelopt_examples:latest`)
69+
# You may customize `docker/Dockerfile` to include or exclude certain dependencies you may or may not need.
70+
bash docker/build.sh
7571

76-
# Obtain and start the basic docker image environment.
77-
# The default built docker image is docker.io/library/modelopt_examples:latest
72+
# Run the docker image
7873
docker run --gpus all -it --shm-size 20g --rm docker.io/library/modelopt_examples:latest bash
7974

80-
# Check installation
81-
python -c "import modelopt"
75+
# Check installation (inside the docker container)
76+
python -c "import modelopt; print(modelopt.__version__)"
8277
```
8378

84-
NOTE: Unless specified otherwise, all example READMEs assume they are using the ModelOpt docker image for running the examples.
79+
See the [installation guide](https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/2_installation.html) for more details on alternate pre-built docker images or installation in a local environment.
8580

86-
Alternatively for PyTorch, you can also use [NVIDIA NGC PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags) with Model Optimizer pre-installed starting from 24.06 container. Make sure to update the Model Optimizer version to the latest one if not already.
81+
NOTE: Unless specified otherwise, all example READMEs assume they are using the above ModelOpt docker image for running the examples. The example specific dependencies are required to be install separately from their respective `requirements.txt` files if not using the ModelOpt's docker image.
8782

8883
## Techniques
8984

@@ -97,7 +92,7 @@ Sparsity is a technique to further reduce the memory footprint of deep learning
9792

9893
### Pruning
9994

100-
Pruning is a technique to reduce the model size and accelerate the inference by removing unnecessary weights. Model Optimizer provides Python APIs to prune Linear and Conv layers, and Transformer attention heads, MLP, and depth.
95+
Pruning is a technique to reduce the model size and accelerate the inference by removing unnecessary weights. Model Optimizer provides Python APIs to prune Linear and Conv layers, and Transformer attention heads, MLP, embedding hidden size and number of layers (depth).
10196

10297
### Distillation
10398

chained_optimizations/bert_prune_distill_quantize.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@
3939
Example showcasing how to do end-to-end optimization of a BERT model on SQuAD using Model Optimizer.
4040
This includes GradNAS pruning, INT8 quantization, fine-tuning / QAT with distillation, and ONNX export.
4141
"""
42+
4243
import argparse
4344
import collections
4445
import json
@@ -875,7 +876,6 @@ def teacher_factory(model_name_or_path):
875876

876877
# Model Optimizer: Define a custom distillation loss function that uses start and end logits
877878
class StartEndLogitsDistillationLoss(mtd.LogitsDistillationLoss):
878-
879879
def forward(self, outputs_s, outputs_t):
880880
loss_start = super().forward(outputs_s.start_logits, outputs_t.start_logits)
881881
loss_end = super().forward(outputs_s.end_logits, outputs_t.end_logits)

diffusers/cache_diffusion/example.ipynb

Lines changed: 27 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,20 @@
77
"outputs": [],
88
"source": [
99
"import torch\n",
10-
"from diffusers import PixArtAlphaPipeline, DiffusionPipeline, StableVideoDiffusionPipeline, StableDiffusion3Pipeline\n",
11-
"from diffusers.utils import load_image, export_to_video, make_image_grid\n",
1210
"from cache_diffusion import cachify\n",
13-
"from cache_diffusion.utils import SVD_DEFAULT_CONFIG, SDXL_DEFAULT_CONFIG, PIXART_DEFAULT_CONFIG, SD3_DEFAULT_CONFIG"
11+
"from cache_diffusion.utils import (\n",
12+
" PIXART_DEFAULT_CONFIG,\n",
13+
" SD3_DEFAULT_CONFIG,\n",
14+
" SDXL_DEFAULT_CONFIG,\n",
15+
" SVD_DEFAULT_CONFIG,\n",
16+
")\n",
17+
"from diffusers import (\n",
18+
" DiffusionPipeline,\n",
19+
" PixArtAlphaPipeline,\n",
20+
" StableDiffusion3Pipeline,\n",
21+
" StableVideoDiffusionPipeline,\n",
22+
")\n",
23+
"from diffusers.utils import export_to_video, load_image, make_image_grid"
1424
]
1525
},
1626
{
@@ -78,7 +88,9 @@
7888
"outputs": [],
7989
"source": [
8090
"generator = torch.Generator(device=\"cuda\").manual_seed(2946901)\n",
81-
"baseline_img_20_steps = pipe(prompt=prompt, num_inference_steps=num_inference_steps, generator=generator).images[0]"
91+
"baseline_img_20_steps = pipe(\n",
92+
" prompt=prompt, num_inference_steps=num_inference_steps, generator=generator\n",
93+
").images[0]"
8294
]
8395
},
8496
{
@@ -123,7 +135,9 @@
123135
"generator = torch.Generator(device=\"cuda\").manual_seed(2946901)\n",
124136
"\n",
125137
"with cachify.infer(pipe) as cached_pipe:\n",
126-
" cache_img = cached_pipe(prompt=prompt, num_inference_steps=num_inference_steps, generator=generator).images[0]"
138+
" cache_img = cached_pipe(\n",
139+
" prompt=prompt, num_inference_steps=num_inference_steps, generator=generator\n",
140+
" ).images[0]"
127141
]
128142
},
129143
{
@@ -174,7 +188,9 @@
174188
"generator = torch.Generator(device=\"cuda\").manual_seed(2946901)\n",
175189
"\n",
176190
"with cachify.infer(pipe) as cached_pipe:\n",
177-
" img = cached_pipe(prompt=prompt, generator=generator, num_inference_steps=num_inference_steps).images[0]"
191+
" img = cached_pipe(\n",
192+
" prompt=prompt, generator=generator, num_inference_steps=num_inference_steps\n",
193+
" ).images[0]"
178194
]
179195
},
180196
{
@@ -262,9 +278,11 @@
262278
"metadata": {},
263279
"outputs": [],
264280
"source": [
265-
"pipe = StableDiffusion3Pipeline.from_pretrained(\"stabilityai/stable-diffusion-3-medium-diffusers\", torch_dtype=torch.float16)\n",
281+
"pipe = StableDiffusion3Pipeline.from_pretrained(\n",
282+
" \"stabilityai/stable-diffusion-3-medium-diffusers\", torch_dtype=torch.float16\n",
283+
")\n",
266284
"pipe = pipe.to(\"cuda\")\n",
267-
"num_inference_steps=28"
285+
"num_inference_steps = 28"
268286
]
269287
},
270288
{
@@ -290,7 +308,7 @@
290308
" negative_prompt=\"\",\n",
291309
" num_inference_steps=28,\n",
292310
" guidance_scale=7.0,\n",
293-
" generator=generator\n",
311+
" generator=generator,\n",
294312
" ).images[0]\n",
295313
"cached_img"
296314
]

diffusers/cache_diffusion/pipeline/models/sdxl.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,6 @@ def cacheunet_forward(
116116
encoder_attention_mask: Optional[torch.Tensor] = None,
117117
return_dict: bool = True,
118118
) -> Union[UNet2DConditionOutput, Tuple]:
119-
120119
# 1. time
121120
t_emb = self.get_time_embed(sample=sample, timestep=timestep)
122121
emb = self.time_embedding(t_emb, timestep_cond)

diffusers/cache_diffusion/pipeline/utils.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,7 @@
2626
import torch
2727
from cuda import cudart
2828
from polygraphy.backend.common import bytes_from_path
29-
from polygraphy.backend.trt import (
30-
engine_from_bytes,
31-
)
29+
from polygraphy.backend.trt import engine_from_bytes
3230

3331
numpy_to_torch_dtype_dict = {
3432
np.uint8: torch.uint8,
@@ -88,7 +86,6 @@ def allocate_buffers(self, shape_dict=None, device="cuda", batch_size=1):
8886
self.tensors[name] = tensor
8987

9088
def __call__(self, feed_dict, stream, use_cuda_graph=False):
91-
9289
for name, buf in feed_dict.items():
9390
self.tensors[name].copy_(buf)
9491

diffusers/quantization/README.md

Lines changed: 22 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -120,32 +120,37 @@ Note, the engines must be built on the same GPU, and ensure that the INT8 engine
120120
- Run the above txt2img example command again. You can compare the generated images and latency for fp16 vs int8.
121121
Similarly, you could run end-to-end pipeline with Model Optimizer quantized backbone and corresponding examples in demoDiffusion with other diffusion models.
122122

123-
### ModelOPT Python-native TRT Pipeline
123+
### Running the inference pipeline with DeviceModel
124124

125-
For our testing pipeline, all you need to do is generate the engine file using `trtexec`. The pipeline will then automatically load it for TensorRT inference. For more details, you can check the available options by running:
125+
DeviceModel is an interface designed to run TensorRT engines like torch models. It takes torch inputs and returns torch outputs. Under the hood, DeviceModel exports a torch checkpoint to ONNX and then generates a TensorRT engine from it. This allows you to swap the backbone of the diffusion pipeline with DeviceModel and execute the pipeline for your desired prompt.<br><br>
126126

127-
```bash
128-
python trt_infer.py --help
129-
```
130-
131-
To run the pipeline, execute the following command:
127+
Generate a quantized torch checkpoint using the command shown below:
132128

133129
```bash
134-
python trt_infer.py --model {sdxl-1.0|sd3-medium|flux-dev} --inf-img-size 1
130+
python quantize.py \
131+
--model {sdxl-1.0|sdxl-turbo|sd2.1|sd2.1-base|sd3-medium|flux-dev|flux-schnell} \
132+
--format fp8 \
133+
--batch-size {1|2} \
134+
--calib-size 128 \
135+
--quant-level 3.0 \
136+
--n-steps 20 \
137+
--quantized-torch-ckpt-save-path ./{MODEL}_fp8.pt \
138+
--collect-method default \
135139
```
136140

137-
If you prefer to use the Python-native TRT Pipeline in your scripts, you can use the following code:
141+
Generate images for the quantized checkpoint with the following command:
138142

139-
```
140-
deploy.load(
141-
pipe,
142-
{sdxl-1.0|sd3-medium},
143-
Path({YOUR_ENGINE_FILE_PATH}),
144-
{1|2|8|16},
145-
)
143+
```bash
144+
python diffusion_trt.py \
145+
--model {sdxl-1.0|sdxl-turbo|sd2.1|sd2.1-base|sd3-medium|flux-dev|flux-schnell} \
146+
--prompt "A cat holding a sign that says hello world" \
147+
[--restore-from ./{MODEL}_fp8.pt] \
148+
[--onnx-load-path {ONNX_DIR}] \
149+
[--trt_engine-path {ENGINE_DIR}]
146150
```
147151

148-
After that, you can use the pipe as you normally would with the Diffusers pipeline on your local machine, and it will automatically run in TensorRT without any additional changes, which will run faster than the PyTorch runtime.
152+
This script will save the output image as `./{MODEL}.png` and report the latency of the TensorRT backbone.
153+
To generate the image with FP16|BF16 precision, you can run the command shown above without the `--restore-from` argument.<br><br>
149154

150155
## Demo Images
151156

diffusers/quantization/config.py

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,3 +132,109 @@ def set_stronglytyped_precision(quant_config, precision: str = "Half"):
132132
for key in quant_config["quant_cfg"].keys():
133133
if "trt_high_precision_dtype" in quant_config["quant_cfg"][key].keys():
134134
quant_config["quant_cfg"][key]["trt_high_precision_dtype"] = precision
135+
136+
137+
def update_dynamic_axes(model, dynamic_axes):
138+
if model in ["flux-dev", "flux-schnell"]:
139+
dynamic_axes["out.0"] = dynamic_axes.pop("output")
140+
elif model in ["sdxl-1.0", "sdxl-turbo"]:
141+
dynamic_axes["added_cond_kwargs.text_embeds"] = dynamic_axes.pop("text_embeds")
142+
dynamic_axes["added_cond_kwargs.time_ids"] = dynamic_axes.pop("time_ids")
143+
dynamic_axes["out.0"] = dynamic_axes.pop("latent")
144+
elif model in ["sd2.1", "sd2.1-base"]:
145+
dynamic_axes["out.0"] = dynamic_axes.pop("latent")
146+
elif model == "sd3-medium":
147+
dynamic_axes["out.0"] = dynamic_axes.pop("sample")
148+
149+
150+
SDXL_DYNAMIC_SHAPES = {
151+
"sample": {"min": [2, 4, 128, 128], "opt": [16, 4, 128, 128]},
152+
"timestep": {"min": [1], "opt": [1]},
153+
"encoder_hidden_states": {"min": [2, 77, 2048], "opt": [16, 77, 2048]},
154+
"added_cond_kwargs.text_embeds": {"min": [2, 1280], "opt": [16, 1280]},
155+
"added_cond_kwargs.time_ids": {"min": [2, 6], "opt": [16, 6]},
156+
}
157+
158+
SD2_DYNAMIC_SHAPES = {
159+
"sample": {"min": [2, 4, 96, 96], "opt": [16, 4, 96, 96]},
160+
"timestep": {"min": [1], "opt": [1]},
161+
"encoder_hidden_states": {"min": [2, 77, 1024], "opt": [16, 77, 1024]},
162+
}
163+
164+
SD2_BASE_DYNAMIC_SHAPES = {
165+
"sample": {"min": [2, 4, 64, 64], "opt": [16, 4, 64, 64]},
166+
"timestep": {"min": [1], "opt": [1]},
167+
"encoder_hidden_states": {"min": [2, 77, 1024], "opt": [16, 77, 1024]},
168+
}
169+
170+
SD3_DYNAMIC_SHAPES = {
171+
"hidden_states": {"min": [2, 16, 128, 128], "opt": [16, 16, 128, 128]},
172+
"timestep": {"min": [2], "opt": [16]},
173+
"encoder_hidden_states": {"min": [2, 333, 4096], "opt": [16, 333, 4096]},
174+
"pooled_projections": {"min": [2, 2048], "opt": [16, 2048]},
175+
}
176+
177+
FLUX_DEV_DYNAMIC_SHAPES = {
178+
"hidden_states": {"min": [1, 4096, 64], "opt": [1, 4096, 64]},
179+
"timestep": {"min": [1], "opt": [1]},
180+
"guidance": {"min": [1], "opt": [1]},
181+
"pooled_projections": {"min": [1, 768], "opt": [1, 768]},
182+
"encoder_hidden_states": {"min": [1, 512, 4096], "opt": [1, 512, 4096]},
183+
"txt_ids": {"min": [1, 512, 3], "opt": [1, 512, 3]},
184+
"img_ids": {"min": [1, 4096, 3], "opt": [1, 4096, 3]},
185+
}
186+
187+
FLUX_SCHNELL_DYNAMIC_SHAPES = FLUX_DEV_DYNAMIC_SHAPES.copy()
188+
FLUX_SCHNELL_DYNAMIC_SHAPES.pop("guidance")
189+
190+
191+
def create_dynamic_shapes(dynamic_shapes):
192+
min_shapes = {}
193+
opt_shapes = {}
194+
for key, value in dynamic_shapes.items():
195+
min_shapes[key] = value["min"]
196+
opt_shapes[key] = value["opt"]
197+
return {
198+
"dynamic_shapes": {
199+
"minShapes": min_shapes,
200+
"optShapes": opt_shapes,
201+
"maxShapes": opt_shapes,
202+
}
203+
}
204+
205+
206+
DYNAMIC_SHAPES = {
207+
"sdxl-1.0": create_dynamic_shapes(SDXL_DYNAMIC_SHAPES),
208+
"sdxl-turbo": create_dynamic_shapes(SDXL_DYNAMIC_SHAPES),
209+
"sd2.1": create_dynamic_shapes(SD2_DYNAMIC_SHAPES),
210+
"sd2.1-base": create_dynamic_shapes(SD2_BASE_DYNAMIC_SHAPES),
211+
"sd3-medium": create_dynamic_shapes(SD3_DYNAMIC_SHAPES),
212+
"flux-dev": create_dynamic_shapes(FLUX_DEV_DYNAMIC_SHAPES),
213+
"flux-schnell": create_dynamic_shapes(FLUX_SCHNELL_DYNAMIC_SHAPES),
214+
}
215+
216+
IO_SHAPES = {
217+
"sdxl-1.0": {"out.0": [2, 4, 128, 128]},
218+
"sdxl-turbo": {"out.0": [2, 4, 64, 64]},
219+
"sd2.1": {"out.0": [2, 4, 96, 96]},
220+
"sd2.1-base": {"out.0": [2, 4, 64, 64]},
221+
"sd3-medium": {"out.0": [2, 16, 128, 128]},
222+
"flux-dev": {},
223+
"flux-schnell": {},
224+
}
225+
226+
227+
def get_io_shapes(model, onnx_load_path):
228+
output_name = ""
229+
if onnx_load_path != "":
230+
if model in ["sdxl-1.0", "sdxl-turbo", "sd2.1", "sd2.1-base"]:
231+
output_name = "latent"
232+
elif model in ["flux-dev", "flux-schnell"]:
233+
output_name = "output"
234+
elif model in ["sd3-medium"]:
235+
output_name = "sample"
236+
else:
237+
output_name = "out.0"
238+
io_shapes = IO_SHAPES[model]
239+
io_shapes[output_name] = io_shapes.pop("out.0")
240+
return io_shapes

0 commit comments

Comments
 (0)