NVIDIA
diff --git a/‎README.md‎
Lines changed: 18 additions & 23 deletions b/‎README.md‎
Lines changed: 18 additions & 23 deletions
diff --git a/‎chained_optimizations/bert_prune_distill_quantize.py‎
Lines changed: 1 addition & 1 deletion b/‎chained_optimizations/bert_prune_distill_quantize.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎diffusers/cache_diffusion/example.ipynb‎
Lines changed: 27 additions & 9 deletions b/‎diffusers/cache_diffusion/example.ipynb‎
Lines changed: 27 additions & 9 deletions
diff --git a/‎diffusers/cache_diffusion/pipeline/models/sdxl.py‎
Lines changed: 0 additions & 1 deletion b/‎diffusers/cache_diffusion/pipeline/models/sdxl.py‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎diffusers/cache_diffusion/pipeline/utils.py‎
Lines changed: 1 addition & 4 deletions b/‎diffusers/cache_diffusion/pipeline/utils.py‎
Lines changed: 1 addition & 4 deletions
diff --git a/‎diffusers/quantization/README.md‎
Lines changed: 22 additions & 17 deletions b/‎diffusers/quantization/README.md‎
Lines changed: 22 additions & 17 deletions
diff --git a/‎diffusers/quantization/config.py‎
Lines changed: 106 additions & 0 deletions b/‎diffusers/quantization/config.py‎
Lines changed: 106 additions & 0 deletions
@@ -16,6 +16,7 @@
 
 ## Latest News
 
+- \[2024/9/10\] [Post-Training Quantization of LLMs with NVIDIA NeMo and TensorRT Model Optimizer](https://developer.nvidia.com/blog/post-training-quantization-of-llms-with-nvidia-nemo-and-nvidia-tensorrt-model-optimizer/)
 - \[2024/8/28\] [Boosting Llama 3.1 405B Performance up to 44% with TensorRT Model Optimizer on NVIDIA H200 GPUs](https://developer.nvidia.com/blog/boosting-llama-3-1-405b-performance-by-up-to-44-with-nvidia-tensorrt-model-optimizer-on-nvidia-h200-gpus/)
 - \[2024/8/28\] [Up to 1.9X Higher Llama 3.1 Performance with Medusa](https://developer.nvidia.com/blog/low-latency-inference-chapter-1-up-to-1-9x-higher-llama-3-1-performance-with-medusa-on-nvidia-hgx-h200-with-nvlink-switch/)
 - \[2024/08/15\] New features in recent releases: [Cache Diffusion](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/diffusers/cache_diffusion), [QLoRA workflow with NVIDIA NeMo](https://docs.nvidia.com/nemo-framework/user-guide/latest/sft_peft/qlora.html), and more. Check out [our blog](https://developer.nvidia.com/blog/nvidia-tensorrt-model-optimizer-v0-15-boosts-inference-performance-and-expands-model-support/) for details.
@@ -51,39 +52,33 @@ For enterprise users, the 8-bit quantization with Stable Diffusion is also avail
 
 Model Optimizer is available for free for all developers on [NVIDIA PyPI](https://pypi.org/project/nvidia-modelopt/). This repository is for sharing examples and GPU-optimized recipes as well as collecting feedback from the community.
 
-## Installation
+## Installation / Docker
 
-### [PIP](https://pypi.org/project/nvidia-modelopt/)
+Easiest way to get started with using Model Optimizer and additional dependencies (e.g. TensorRT-LLM deployment) is to start from our docker image.
 
-```bash
-pip install "nvidia-modelopt[all]~=0.17.0" --extra-index-url https://pypi.nvidia.com
-```
-
-See the [installation guide](https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/2_installation.html) for more fine-grained control over the installation.
-
-Make sure to also install example-specific dependencies from their respective `requirements.txt` files if any.
-
-### Docker
-
-After installing the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit),
-please run the following commands to build the Model Optimizer example docker container which has all the necessary
+After installing the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html),
+please run the following commands to build the Model Optimizer docker container which has all the necessary
 dependencies pre-installed for running the examples.
 
 ```bash
-# Build the docker
-docker/build.sh
+# Clone the ModelOpt repository
+git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git
+cd TensorRT-Model-Optimizer
+
+# Build the docker (will be tagged `docker.io/library/modelopt_examples:latest`)
+# You may customize `docker/Dockerfile` to include or exclude certain dependencies you may or may not need.
+bash docker/build.sh
 
-# Obtain and start the basic docker image environment.
-# The default built docker image is docker.io/library/modelopt_examples:latest
+# Run the docker image
 docker run --gpus all -it --shm-size 20g --rm docker.io/library/modelopt_examples:latest bash
 
-# Check installation
-python -c "import modelopt"
+# Check installation (inside the docker container)
+python -c "import modelopt; print(modelopt.__version__)"
 ```
 
-NOTE: Unless specified otherwise, all example READMEs assume they are using the ModelOpt docker image for running the examples.
+See the [installation guide](https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/2_installation.html) for more details on alternate pre-built docker images or installation in a local environment.
 
-Alternatively for PyTorch, you can also use [NVIDIA NGC PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags) with Model Optimizer pre-installed starting from 24.06 container. Make sure to update the Model Optimizer version to the latest one if not already.
+NOTE: Unless specified otherwise, all example READMEs assume they are using the above ModelOpt docker image for running the examples. The example specific dependencies are required to be install separately from their respective `requirements.txt` files if not using the ModelOpt's docker image.
 
 ## Techniques
 
@@ -97,7 +92,7 @@ Sparsity is a technique to further reduce the memory footprint of deep learning
 
 ### Pruning
 
-Pruning is a technique to reduce the model size and accelerate the inference by removing unnecessary weights. Model Optimizer provides Python APIs to prune Linear and Conv layers, and Transformer attention heads, MLP, and depth.
+Pruning is a technique to reduce the model size and accelerate the inference by removing unnecessary weights. Model Optimizer provides Python APIs to prune Linear and Conv layers, and Transformer attention heads, MLP, embedding hidden size and number of layers (depth).
 
 ### Distillation
 
 
@@ -39,6 +39,7 @@
 Example showcasing how to do end-to-end optimization of a BERT model on SQuAD using Model Optimizer.
 This includes GradNAS pruning, INT8 quantization, fine-tuning / QAT with distillation, and ONNX export.
 """
+
 import argparse
 import collections
 import json
@@ -875,7 +876,6 @@ def teacher_factory(model_name_or_path):
 
 # Model Optimizer: Define a custom distillation loss function that uses start and end logits
 class StartEndLogitsDistillationLoss(mtd.LogitsDistillationLoss):
-
     def forward(self, outputs_s, outputs_t):
         loss_start = super().forward(outputs_s.start_logits, outputs_t.start_logits)
         loss_end = super().forward(outputs_s.end_logits, outputs_t.end_logits)
 
@@ -7,10 +7,20 @@
    "outputs": [],
    "source": [
     "import torch\n",
-    "from diffusers import PixArtAlphaPipeline, DiffusionPipeline, StableVideoDiffusionPipeline, StableDiffusion3Pipeline\n",
-    "from diffusers.utils import load_image, export_to_video, make_image_grid\n",
     "from cache_diffusion import cachify\n",
-    "from cache_diffusion.utils import SVD_DEFAULT_CONFIG, SDXL_DEFAULT_CONFIG, PIXART_DEFAULT_CONFIG, SD3_DEFAULT_CONFIG"
+    "from cache_diffusion.utils import (\n",
+    "    PIXART_DEFAULT_CONFIG,\n",
+    "    SD3_DEFAULT_CONFIG,\n",
+    "    SDXL_DEFAULT_CONFIG,\n",
+    "    SVD_DEFAULT_CONFIG,\n",
+    ")\n",
+    "from diffusers import (\n",
+    "    DiffusionPipeline,\n",
+    "    PixArtAlphaPipeline,\n",
+    "    StableDiffusion3Pipeline,\n",
+    "    StableVideoDiffusionPipeline,\n",
+    ")\n",
+    "from diffusers.utils import export_to_video, load_image, make_image_grid"
    ]
   },
   {
@@ -78,7 +88,9 @@
    "outputs": [],
    "source": [
     "generator = torch.Generator(device=\"cuda\").manual_seed(2946901)\n",
-    "baseline_img_20_steps = pipe(prompt=prompt, num_inference_steps=num_inference_steps, generator=generator).images[0]"
+    "baseline_img_20_steps = pipe(\n",
+    "    prompt=prompt, num_inference_steps=num_inference_steps, generator=generator\n",
+    ").images[0]"
    ]
   },
   {
@@ -123,7 +135,9 @@
     "generator = torch.Generator(device=\"cuda\").manual_seed(2946901)\n",
     "\n",
     "with cachify.infer(pipe) as cached_pipe:\n",
-    "    cache_img = cached_pipe(prompt=prompt, num_inference_steps=num_inference_steps, generator=generator).images[0]"
+    "    cache_img = cached_pipe(\n",
+    "        prompt=prompt, num_inference_steps=num_inference_steps, generator=generator\n",
+    "    ).images[0]"
    ]
   },
   {
@@ -174,7 +188,9 @@
     "generator = torch.Generator(device=\"cuda\").manual_seed(2946901)\n",
     "\n",
     "with cachify.infer(pipe) as cached_pipe:\n",
-    "    img = cached_pipe(prompt=prompt, generator=generator, num_inference_steps=num_inference_steps).images[0]"
+    "    img = cached_pipe(\n",
+    "        prompt=prompt, generator=generator, num_inference_steps=num_inference_steps\n",
+    "    ).images[0]"
    ]
   },
   {
@@ -262,9 +278,11 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "pipe = StableDiffusion3Pipeline.from_pretrained(\"stabilityai/stable-diffusion-3-medium-diffusers\", torch_dtype=torch.float16)\n",
+    "pipe = StableDiffusion3Pipeline.from_pretrained(\n",
+    "    \"stabilityai/stable-diffusion-3-medium-diffusers\", torch_dtype=torch.float16\n",
+    ")\n",
     "pipe = pipe.to(\"cuda\")\n",
-    "num_inference_steps=28"
+    "num_inference_steps = 28"
    ]
   },
   {
@@ -290,7 +308,7 @@
     "        negative_prompt=\"\",\n",
     "        num_inference_steps=28,\n",
     "        guidance_scale=7.0,\n",
-    "        generator=generator\n",
+    "        generator=generator,\n",
     "    ).images[0]\n",
     "cached_img"
    ]
 
@@ -116,7 +116,6 @@ def cacheunet_forward(
     encoder_attention_mask: Optional[torch.Tensor] = None,
     return_dict: bool = True,
 ) -> Union[UNet2DConditionOutput, Tuple]:
-
     # 1. time
     t_emb = self.get_time_embed(sample=sample, timestep=timestep)
     emb = self.time_embedding(t_emb, timestep_cond)
 
@@ -26,9 +26,7 @@
 import torch
 from cuda import cudart
 from polygraphy.backend.common import bytes_from_path
-from polygraphy.backend.trt import (
-    engine_from_bytes,
-)
+from polygraphy.backend.trt import engine_from_bytes
 
 numpy_to_torch_dtype_dict = {
     np.uint8: torch.uint8,
@@ -88,7 +86,6 @@ def allocate_buffers(self, shape_dict=None, device="cuda", batch_size=1):
             self.tensors[name] = tensor
 
     def __call__(self, feed_dict, stream, use_cuda_graph=False):
-
         for name, buf in feed_dict.items():
             self.tensors[name].copy_(buf)
 
 
@@ -120,32 +120,37 @@ Note, the engines must be built on the same GPU, and ensure that the INT8 engine
 - Run the above txt2img example command again. You can compare the generated images and latency for fp16 vs int8.
   Similarly, you could run end-to-end pipeline with Model Optimizer quantized backbone and corresponding examples in demoDiffusion with other diffusion models.
 
-### ModelOPT Python-native TRT Pipeline
+### Running the inference pipeline with DeviceModel
 
-For our testing pipeline, all you need to do is generate the engine file using `trtexec`. The pipeline will then automatically load it for TensorRT inference. For more details, you can check the available options by running:
+DeviceModel is an interface designed to run TensorRT engines like torch models. It takes torch inputs and returns torch outputs. Under the hood, DeviceModel exports a torch checkpoint to ONNX and then generates a TensorRT engine from it. This allows you to swap the backbone of the diffusion pipeline with DeviceModel and execute the pipeline for your desired prompt.<br><br>
 
-```bash
-python trt_infer.py --help
-```
-
-To run the pipeline, execute the following command:
+Generate a quantized torch checkpoint using the command shown below:
 
 ```bash
-python trt_infer.py --model {sdxl-1.0|sd3-medium|flux-dev} --inf-img-size 1
+python quantize.py \
+  --model {sdxl-1.0|sdxl-turbo|sd2.1|sd2.1-base|sd3-medium|flux-dev|flux-schnell} \
+  --format fp8 \
+  --batch-size {1|2} \
+  --calib-size 128 \
+  --quant-level 3.0 \
+  --n-steps 20 \
+  --quantized-torch-ckpt-save-path ./{MODEL}_fp8.pt \
+  --collect-method default \
 ```
 
-If you prefer to use the Python-native TRT Pipeline in your scripts, you can use the following code:
+Generate images for the quantized checkpoint with the following command:
 
-```
-deploy.load(
-    pipe,
-    {sdxl-1.0|sd3-medium},
-    Path({YOUR_ENGINE_FILE_PATH}),
-    {1|2|8|16},
-)
+```bash
+python diffusion_trt.py \
+  --model {sdxl-1.0|sdxl-turbo|sd2.1|sd2.1-base|sd3-medium|flux-dev|flux-schnell} \
+  --prompt "A cat holding a sign that says hello world" \
+  [--restore-from ./{MODEL}_fp8.pt] \
+  [--onnx-load-path {ONNX_DIR}] \
+  [--trt_engine-path {ENGINE_DIR}]
 ```
 
-After that, you can use the pipe as you normally would with the Diffusers pipeline on your local machine, and it will automatically run in TensorRT without any additional changes, which will run faster than the PyTorch runtime.
+This script will save the output image as `./{MODEL}.png` and report the latency of the TensorRT backbone.
+To generate the image with FP16|BF16 precision, you can run the command shown above without the `--restore-from` argument.<br><br>
 
 ## Demo Images
 
 
@@ -132,3 +132,109 @@ def set_stronglytyped_precision(quant_config, precision: str = "Half"):
     for key in quant_config["quant_cfg"].keys():
         if "trt_high_precision_dtype" in quant_config["quant_cfg"][key].keys():
             quant_config["quant_cfg"][key]["trt_high_precision_dtype"] = precision
+
+
+def update_dynamic_axes(model, dynamic_axes):
+    if model in ["flux-dev", "flux-schnell"]:
+        dynamic_axes["out.0"] = dynamic_axes.pop("output")
+    elif model in ["sdxl-1.0", "sdxl-turbo"]:
+        dynamic_axes["added_cond_kwargs.text_embeds"] = dynamic_axes.pop("text_embeds")
+        dynamic_axes["added_cond_kwargs.time_ids"] = dynamic_axes.pop("time_ids")
+        dynamic_axes["out.0"] = dynamic_axes.pop("latent")
+    elif model in ["sd2.1", "sd2.1-base"]:
+        dynamic_axes["out.0"] = dynamic_axes.pop("latent")
+    elif model == "sd3-medium":
+        dynamic_axes["out.0"] = dynamic_axes.pop("sample")
+
+
+SDXL_DYNAMIC_SHAPES = {
+    "sample": {"min": [2, 4, 128, 128], "opt": [16, 4, 128, 128]},
+    "timestep": {"min": [1], "opt": [1]},
+    "encoder_hidden_states": {"min": [2, 77, 2048], "opt": [16, 77, 2048]},
+    "added_cond_kwargs.text_embeds": {"min": [2, 1280], "opt": [16, 1280]},
+    "added_cond_kwargs.time_ids": {"min": [2, 6], "opt": [16, 6]},
+}
+
+SD2_DYNAMIC_SHAPES = {
+    "sample": {"min": [2, 4, 96, 96], "opt": [16, 4, 96, 96]},
+    "timestep": {"min": [1], "opt": [1]},
+    "encoder_hidden_states": {"min": [2, 77, 1024], "opt": [16, 77, 1024]},
+}
+
+SD2_BASE_DYNAMIC_SHAPES = {
+    "sample": {"min": [2, 4, 64, 64], "opt": [16, 4, 64, 64]},
+    "timestep": {"min": [1], "opt": [1]},
+    "encoder_hidden_states": {"min": [2, 77, 1024], "opt": [16, 77, 1024]},
+}
+
+SD3_DYNAMIC_SHAPES = {
+    "hidden_states": {"min": [2, 16, 128, 128], "opt": [16, 16, 128, 128]},
+    "timestep": {"min": [2], "opt": [16]},
+    "encoder_hidden_states": {"min": [2, 333, 4096], "opt": [16, 333, 4096]},
+    "pooled_projections": {"min": [2, 2048], "opt": [16, 2048]},
+}
+
+FLUX_DEV_DYNAMIC_SHAPES = {
+    "hidden_states": {"min": [1, 4096, 64], "opt": [1, 4096, 64]},
+    "timestep": {"min": [1], "opt": [1]},
+    "guidance": {"min": [1], "opt": [1]},
+    "pooled_projections": {"min": [1, 768], "opt": [1, 768]},
+    "encoder_hidden_states": {"min": [1, 512, 4096], "opt": [1, 512, 4096]},
+    "txt_ids": {"min": [1, 512, 3], "opt": [1, 512, 3]},
+    "img_ids": {"min": [1, 4096, 3], "opt": [1, 4096, 3]},
+}
+
+FLUX_SCHNELL_DYNAMIC_SHAPES = FLUX_DEV_DYNAMIC_SHAPES.copy()
+FLUX_SCHNELL_DYNAMIC_SHAPES.pop("guidance")
+
+
+def create_dynamic_shapes(dynamic_shapes):
+    min_shapes = {}
+    opt_shapes = {}
+    for key, value in dynamic_shapes.items():
+        min_shapes[key] = value["min"]
+        opt_shapes[key] = value["opt"]
+    return {
+        "dynamic_shapes": {
+            "minShapes": min_shapes,
+            "optShapes": opt_shapes,
+            "maxShapes": opt_shapes,
+        }
+    }
+
+
+DYNAMIC_SHAPES = {
+    "sdxl-1.0": create_dynamic_shapes(SDXL_DYNAMIC_SHAPES),
+    "sdxl-turbo": create_dynamic_shapes(SDXL_DYNAMIC_SHAPES),
+    "sd2.1": create_dynamic_shapes(SD2_DYNAMIC_SHAPES),
+    "sd2.1-base": create_dynamic_shapes(SD2_BASE_DYNAMIC_SHAPES),
+    "sd3-medium": create_dynamic_shapes(SD3_DYNAMIC_SHAPES),
+    "flux-dev": create_dynamic_shapes(FLUX_DEV_DYNAMIC_SHAPES),
+    "flux-schnell": create_dynamic_shapes(FLUX_SCHNELL_DYNAMIC_SHAPES),
+}
+
+IO_SHAPES = {
+    "sdxl-1.0": {"out.0": [2, 4, 128, 128]},
+    "sdxl-turbo": {"out.0": [2, 4, 64, 64]},
+    "sd2.1": {"out.0": [2, 4, 96, 96]},
+    "sd2.1-base": {"out.0": [2, 4, 64, 64]},
+    "sd3-medium": {"out.0": [2, 16, 128, 128]},
+    "flux-dev": {},
+    "flux-schnell": {},
+}
+
+
+def get_io_shapes(model, onnx_load_path):
+    output_name = ""
+    if onnx_load_path != "":
+        if model in ["sdxl-1.0", "sdxl-turbo", "sd2.1", "sd2.1-base"]:
+            output_name = "latent"
+        elif model in ["flux-dev", "flux-schnell"]:
+            output_name = "output"
+        elif model in ["sd3-medium"]:
+            output_name = "sample"
+    else:
+        output_name = "out.0"
+    io_shapes = IO_SHAPES[model]
+    io_shapes[output_name] = io_shapes.pop("out.0")
+    return io_shapes