NVIDIA
diff --git a/‎README.md‎
Lines changed: 3 additions & 3 deletions b/‎README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎benchmark.md‎
Lines changed: 7 additions & 7 deletions b/‎benchmark.md‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎diffusers/cache_diffusion/README.md‎
Lines changed: 31 additions & 7 deletions b/‎diffusers/cache_diffusion/README.md‎
Lines changed: 31 additions & 7 deletions
diff --git a/‎diffusers/cache_diffusion/cache_diffusion/cachify.py‎
Lines changed: 46 additions & 12 deletions b/‎diffusers/cache_diffusion/cache_diffusion/cachify.py‎
Lines changed: 46 additions & 12 deletions
diff --git a/‎diffusers/cache_diffusion/pipeline/config.py‎
Lines changed: 128 additions & 0 deletions b/‎diffusers/cache_diffusion/pipeline/config.py‎
Lines changed: 128 additions & 0 deletions
@@ -37,7 +37,7 @@
 
 ## Model Optimizer Overview
 
-Minimizing inference costs presents a significant challenge as generative AI models continue to grow in complexity and size. The **NVIDIA TensorRT Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization techniques including [quantization](#quantization) and [sparsity](#sparsity) to compress model. It accepts a torch or [ONNX](https://github.com/onnx/onnx) model as inputs and provides Python APIs for users to easily stack different model optimization techniques to produce quantized checkpoint. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) or [TensorRT](https://github.com/NVIDIA/TensorRT). Further integrations are planned for [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) and [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on [NVIDIA NIM](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/).
+Minimizing inference costs presents a significant challenge as generative AI models continue to grow in complexity and size. The **NVIDIA TensorRT Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization techniques including [quantization](#quantization) and [sparsity](#sparsity) to compress models. It accepts a torch or [ONNX](https://github.com/onnx/onnx) model as inputs and provides Python APIs for users to easily stack different model optimization techniques to produce an optimized quantized checkpoint. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) or [TensorRT](https://github.com/NVIDIA/TensorRT). Further integrations are planned for [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) and [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on [NVIDIA NIM](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/).
 
 Model Optimizer is available for free for all developers on [NVIDIA PyPI](https://pypi.org/project/nvidia-modelopt/). This repository is for sharing examples and GPU-optimized recipes as well as collecting feedback from the community.
 
@@ -46,7 +46,7 @@ Model Optimizer is available for free for all developers on [NVIDIA PyPI](https:
 ### [PIP](https://pypi.org/project/nvidia-modelopt/)
 
 ```bash
-pip install "nvidia-modelopt[all]~=0.13.0" --extra-index-url https://pypi.nvidia.com
+pip install "nvidia-modelopt[all]~=0.15.0" --extra-index-url https://pypi.nvidia.com
 ```
 
 See the [installation guide](https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/2_installation.html) for more fine-grained control over the installation.
@@ -68,7 +68,7 @@ docker run --gpus all -it --shm-size 20g --rm docker.io/library/modelopt_example
 python -c "import modelopt"
 ```
 
-Alternatively for PyTorch, you can also use [NVIDIA NGC PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags) with Model Optimizer pre-installed starting from 24.06 PyTorch container. Make sure to update the Model Optimizer version to the latest one if not already.
+Alternatively for PyTorch, you can also use [NVIDIA NGC PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags) with Model Optimizer pre-installed starting from 24.06 container. Make sure to update the Model Optimizer version to the latest one if not already.
 
 ## Techniques
 
 
@@ -8,18 +8,18 @@ performance** that can be delivered by Model Optimizer. All performance numbers
 
 #### 1.1 Performanace
 
-Config: H100, nvidia-modelopt v0.11.0, TensorRT-LLM v0.9, latency measured with full batch inference (no inflight batching).
+Config: H100, nvidia-modelopt v0.15.0, TensorRT-LLM v0.11, latency measured with full batch inference (no inflight batching).
 Memory saving and inference speedup are compared to the FP16 baseline. Speedup is normalized to the GPU count.
 
 |            |            |            |     FP8    |         |   |            |  INT4 AWQ  |         |
 |:----------:|:----------:|:----------:|:----------:|:-------:|:-:|:----------:|:----------:|:-------:|
 |    Model   | Batch Size | Mem Saving | Tokens/sec | Speedup |   | Mem Saving | Tokens/sec | Speedup |
-|  Llama3-8B |      2     |    1.66x   |   337.67   |  1.39x  |   |    2.37x   |   392.99   |  1.61x  |
-|            |     32     |    1.56x   |   2368.69  |  1.66x  |   |    1.86x   |   2037.54  |  1.43x  |
-|            |     64     |    1.54x   |   2404.86  |  1.43x  |   |    1.76x   |   2308.57  |  1.37x  |
-| Llama3-70B |      2     |    1.98x   |    64.35   |  2.11x  |   |    3.49x   |    77.36   |  2.54x  |
-|            |     32     |    1.95x   |   391.73   |  3.03x  |   |    2.94x   |   479.11   |  3.71x  |
-|            |     64     |    1.91x   |   383.42   |  2.41x  |   |    2.46x   |   348.65   |  2.19x  |
+|  Llama3-8B |      1     |    1.63x   |   175.42   |  1.26x  |   |    2.34x   |   213.45   |  1.53x  |
+|            |     32     |    1.62x   |   3399.84  |  1.49x  |   |    1.89x   |   2546.12  |  1.11x  |
+|            |     64     |    1.58x   |   3311.03  |  1.34x  |   |    1.97x   |   3438.08  |  1.39x  |
+| Llama3-70B |      1     |    1.96x   |    32.85   |  1.87x  |   |    3.47x   |    47.49   |  2.70x  |
+|            |     32     |    1.93x   |   462.69   |  1.82x  |   |    2.62x   |   365.06   |  1.44x  |
+|            |     64     |    1.99x   |   449.09   |  1.91x  |   |    2.90x   |   483.51   |  2.05x  |
 
 ### 1.2 Accuracy
 
 
@@ -1,15 +1,11 @@
 # Cache Diffusion
 
-## News
-
-- [Utilizing DeepCache to Accelerate Stable Diffusion-XL Benchmarks in MLPerf Yields Leading Results](https://developer.nvidia.com/blog/nvidia-h200-tensor-core-gpus-and-nvidia-tensorrt-llm-set-mlperf-llm-inference-records/)
-
 ## Introduction
 
 | Supported Framework | Supported Models |
 |----------|----------|
 | **PyTorch** | [**PixArt-α**](https://huggingface.co/PixArt-alpha/PixArt-XL-2-1024-MS), [**Stable Diffusion - XL**](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), [**SVD**](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) |
-| **TensorRT** | **WIP** |
+| **TensorRT** | [**Stable Diffusion - XL**](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) |
 
 Cache Diffusion methods, such as [DeepCache](https://arxiv.org/abs/2312.00858), [Block Caching](https://arxiv.org/abs/2312.03209) and [T-Gate](https://arxiv.org/abs/2404.02747), optimize performance by reusing cached outputs from previous steps instead of recalculating them. This **training-free** caching approach is compatible with a variety of models, like **DiT** and **UNet**, enabling considerable acceleration without compromising quality.
 
@@ -52,9 +48,37 @@ Two parameters are essential: `wildcard_or_filter_func` and `select_cache_step_f
 
 Multiple configurations can be set up, but ensure that the `wildcard_or_filter_func` works correctly. If you input more than one pair of parameters with the same `wildcard_or_filter_func`, the later one in the list will overwrite the previous ones.
 
-## Demo
+### TensorRT support
+
+#### Quick Start
+
+Install [TensorRT](https://developer.nvidia.com/tensorrt) then run:
+
+```bash
+python run_cache_diffusion.py
+```
+
+You can find the latest TensorRT at [here](https://developer.nvidia.com/tensorrt/download).
+
+To execute cache diffusion in TensorRT, follow these steps:
 
-The following demo images are generated using `PyTorch==2.3.0 with 1xAda 6000 GPU backend`. TensorRT support will be available in the next ModelOPT release.
+```python
+# Load the model
+
+compile(
+    pipe.unet,
+    onnx_path=Path("./onnx"),
+    engine_path=Path("./engine"),
+)
+
+cachify.prepare(pipe, num_inference_steps, SDXL_DEFAULT_CONFIG)
+```
+
+Afterward, use it as a standard cache diffusion pipeline to generate the image.
+
+Please note that only the UNET component is running in TensorRT, while the other parts remain in PyTorch.
+
+## Demo
 
 Comparing with naively reducing the generation steps, cache diffusion can achieve the same speedup and also much better image quality, even close to the reference image. If the image quality does not meet your needs or product requirements, you can replace our default configuration with your customized settings.
 
 
@@ -21,9 +21,21 @@
 
 import fnmatch
 
-from diffusers.models.attention import FeedForward
-from diffusers.models.attention_processor import Attention
-from diffusers.models.resnet import ResnetBlock2D, TemporalResnetBlock
+from diffusers.models.attention import BasicTransformerBlock
+from diffusers.models.unets.unet_2d_blocks import (
+    CrossAttnDownBlock2D,
+    CrossAttnUpBlock2D,
+    DownBlock2D,
+    UNetMidBlock2DCrossAttn,
+    UpBlock2D,
+)
+from diffusers.models.unets.unet_3d_blocks import (
+    CrossAttnDownBlockSpatioTemporal,
+    CrossAttnUpBlockSpatioTemporal,
+    DownBlockSpatioTemporal,
+    UNetMidBlockSpatioTemporal,
+    UpBlockSpatioTemporal,
+)
 from diffusers.pipelines.pixart_alpha.pipeline_pixart_alpha import PixArtAlphaPipeline
 from diffusers.pipelines.stable_diffusion_xl.pipeline_stable_diffusion_xl import (
     StableDiffusionXLPipeline,
@@ -35,15 +47,37 @@
 from .module import CachedModule
 from .utils import replace_module
 
-SUPPORTED_METHODS = {PixArtAlphaPipeline, StableDiffusionXLPipeline, StableVideoDiffusionPipeline}
-
-
-def cachify(model, num_inference_steps, config_list):
+CACHED_PIPE = {
+    StableDiffusionXLPipeline: (
+        DownBlock2D,
+        CrossAttnDownBlock2D,
+        UNetMidBlock2DCrossAttn,
+        CrossAttnUpBlock2D,
+        UpBlock2D,
+    ),
+    PixArtAlphaPipeline: (BasicTransformerBlock),
+    StableVideoDiffusionPipeline: (
+        CrossAttnDownBlockSpatioTemporal,
+        DownBlockSpatioTemporal,
+        UpBlockSpatioTemporal,
+        CrossAttnUpBlockSpatioTemporal,
+        UNetMidBlockSpatioTemporal,
+    ),
+}
+
+
+def cachify(model, num_inference_steps, config_list, modules):
+    if hasattr(model, "use_trt_infer") and model.use_trt_infer:
+        for key, _ in model.engines.items():
+            for config in config_list:
+                if _pass(key, config["wildcard_or_filter_func"]):
+                    model.engines[key] = CachedModule(
+                        model.engines[key], num_inference_steps, config["select_cache_step_func"]
+                    )
+        return
     for name, module in model.named_modules():
         for config in config_list:
-            if _pass(name, config["wildcard_or_filter_func"]) and isinstance(
-                module, (Attention, ResnetBlock2D, TemporalResnetBlock, FeedForward)
-            ):
+            if _pass(name, config["wildcard_or_filter_func"]) and isinstance(module, modules):
                 replace_module(
                     model,
                     name,
@@ -86,8 +120,8 @@ def get_model(pipe):
 
 
 def prepare(pipe, num_inference_steps, config_list):
-    assert pipe.__class__ in SUPPORTED_METHODS, f"{pipe.__class__} is not supported!"
+    assert pipe.__class__ in CACHED_PIPE.keys(), f"{pipe.__class__} is not supported!"
 
     model = get_model(pipe)
 
-    cachify(model, num_inference_steps, config_list)
+    cachify(model, num_inference_steps, config_list, CACHED_PIPE[pipe.__class__])
@@ -0,0 +1,128 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: MIT
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+# DEALINGS IN THE SOFTWARE.
+
+SDXL_ONNX_CONFIG = {
+    "down_blocks.0": {
+        "dummy_input": {
+            "hidden_states": (2, 320, 128, 128),
+            "temb": (2, 1280),
+        },
+        "output_names": ["sample", "res_samples_0", "res_samples_1", "res_samples_2"],
+        "dynamic_axes": {
+            "hidden_states": {0: "batch_size"},
+            "temb": {0: "steps"},
+        },
+    },
+    "down_blocks.1": {
+        "dummy_input": {
+            "hidden_states": (2, 320, 64, 64),
+            "temb": (2, 1280),
+            "encoder_hidden_states": (2, 77, 2048),
+        },
+        "output_names": ["sample", "res_samples_0", "res_samples_1", "res_samples_2"],
+        "dynamic_axes": {
+            "hidden_states": {0: "batch_size"},
+            "temb": {0: "steps"},
+            "encoder_hidden_states": {0: "batch_size"},
+        },
+    },
+    "down_blocks.2": {
+        "dummy_input": {
+            "hidden_states": (2, 640, 32, 32),
+            "temb": (2, 1280),
+            "encoder_hidden_states": (2, 77, 2048),
+        },
+        "output_names": ["sample", "res_samples_0", "res_samples_1"],
+        "dynamic_axes": {
+            "hidden_states": {0: "batch_size"},
+            "temb": {0: "steps"},
+            "encoder_hidden_states": {0: "batch_size"},
+        },
+    },
+    "mid_block": {
+        "dummy_input": {
+            "hidden_states": (2, 1280, 32, 32),
+            "temb": (2, 1280),
+            "encoder_hidden_states": (2, 77, 2048),
+        },
+        "output_names": ["sample"],
+        "dynamic_axes": {
+            "hidden_states": {0: "batch_size"},
+            "temb": {0: "steps"},
+            "encoder_hidden_states": {0: "batch_size"},
+        },
+    },
+    "up_blocks.0": {
+        "dummy_input": {
+            "hidden_states": (2, 1280, 32, 32),
+            "res_hidden_states_0": (2, 640, 32, 32),
+            "res_hidden_states_1": (2, 1280, 32, 32),
+            "res_hidden_states_2": (2, 1280, 32, 32),
+            "temb": (2, 1280),
+            "encoder_hidden_states": (2, 77, 2048),
+        },
+        "output_names": ["sample"],
+        "dynamic_axes": {
+            "hidden_states": {0: "batch_size"},
+            "temb": {0: "steps"},
+            "encoder_hidden_states": {0: "batch_size"},
+            "res_hidden_states_0": {0: "batch_size"},
+            "res_hidden_states_1": {0: "batch_size"},
+            "res_hidden_states_2": {0: "batch_size"},
+        },
+    },
+    "up_blocks.1": {
+        "dummy_input": {
+            "hidden_states": (2, 1280, 64, 64),
+            "res_hidden_states_0": (2, 320, 64, 64),
+            "res_hidden_states_1": (2, 640, 64, 64),
+            "res_hidden_states_2": (2, 640, 64, 64),
+            "temb": (2, 1280),
+            "encoder_hidden_states": (2, 77, 2048),
+        },
+        "output_names": ["sample"],
+        "dynamic_axes": {
+            "hidden_states": {0: "batch_size"},
+            "temb": {0: "steps"},
+            "encoder_hidden_states": {0: "batch_size"},
+            "res_hidden_states_0": {0: "batch_size"},
+            "res_hidden_states_1": {0: "batch_size"},
+            "res_hidden_states_2": {0: "batch_size"},
+        },
+    },
+    "up_blocks.2": {
+        "dummy_input": {
+            "hidden_states": (2, 640, 128, 128),
+            "res_hidden_states_0": (2, 320, 128, 128),
+            "res_hidden_states_1": (2, 320, 128, 128),
+            "res_hidden_states_2": (2, 320, 128, 128),
+            "temb": (2, 1280),
+        },
+        "output_names": ["sample"],
+        "dynamic_axes": {
+            "hidden_states": {0: "batch_size"},
+            "temb": {0: "steps"},
+            "res_hidden_states_0": {0: "batch_size"},
+            "res_hidden_states_1": {0: "batch_size"},
+            "res_hidden_states_2": {0: "batch_size"},
+        },
+    },
+}