NVIDIA
diff --git a/‎README.md‎
Lines changed: 10 additions & 3 deletions b/‎README.md‎
Lines changed: 10 additions & 3 deletions
diff --git a/‎diffusers/cache_diffusion/README.md‎
Lines changed: 65 additions & 0 deletions b/‎diffusers/cache_diffusion/README.md‎
Lines changed: 65 additions & 0 deletions
diff --git a/‎diffusers/cache_diffusion/assets/SDXL_Cache_Diffusion_Img.png‎
1.52 MB b/‎diffusers/cache_diffusion/assets/SDXL_Cache_Diffusion_Img.png‎
1.52 MB
diff --git a/‎diffusers/cache_diffusion/assets/sdxl_cache.png‎
445 KB b/‎diffusers/cache_diffusion/assets/sdxl_cache.png‎
445 KB
diff --git a/‎diffusers/cache_diffusion/cache_diffusion/cachify.py‎
Lines changed: 93 additions & 0 deletions b/‎diffusers/cache_diffusion/cache_diffusion/cachify.py‎
Lines changed: 93 additions & 0 deletions
diff --git a/‎diffusers/cache_diffusion/cache_diffusion/module.py‎
Lines changed: 59 additions & 0 deletions b/‎diffusers/cache_diffusion/cache_diffusion/module.py‎
Lines changed: 59 additions & 0 deletions
diff --git a/‎diffusers/cache_diffusion/cache_diffusion/utils.py‎
Lines changed: 52 additions & 0 deletions b/‎diffusers/cache_diffusion/cache_diffusion/utils.py‎
Lines changed: 52 additions & 0 deletions
@@ -16,6 +16,7 @@
 
 ## Latest News
 
+- \[2024/06/03\] Model Optimizer now has an experimental feature to deploy to vLLM as part of our effort to support popular deployment frameworks. Check out the workflow [here](./llm_ptq/README.md#deploy-fp8-quantized-model-using-vllm)
 - \[2024/05/08\] [Announcement: Model Optimizer Now Formally Available to Further Accelerate GenAI Inference Performance](https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/)
 - \[2024/03/27\] [Model Optimizer supercharges TensorRT-LLM to set MLPerf LLM inference records](https://developer.nvidia.com/blog/nvidia-h200-tensor-core-gpus-and-nvidia-tensorrt-llm-set-mlperf-llm-inference-records/)
 - \[2024/03/18\] [GTC Session: Optimize Generative AI Inference with Quantization in TensorRT-LLM and TensorRT](https://www.nvidia.com/en-us/on-demand/session/gtc24-s63213/)
@@ -45,7 +46,7 @@ Model Optimizer is available for free for all developers on [NVIDIA PyPI](https:
 ### [PIP](https://pypi.org/project/nvidia-modelopt/)
 
 ```bash
-pip install "nvidia-modelopt[all]~=0.11.0" --extra-index-url https://pypi.nvidia.com
+pip install "nvidia-modelopt[all]~=0.13.0" --extra-index-url https://pypi.nvidia.com
 ```
 
 See the [installation guide](https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/2_installation.html) for more fine-grained control over the installation.
@@ -67,6 +68,8 @@ docker run --gpus all -it --shm-size 20g --rm docker.io/library/modelopt_example
 python -c "import modelopt"
 ```
 
+Alternatively for PyTorch, you can also use [NVIDIA NGC PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags) with Model Optimizer pre-installed starting from 24.06 PyTorch container. Make sure to update the Model Optimizer version to the latest one if not already.
+
 ## Techniques
 
 ### Quantization
@@ -79,8 +82,12 @@ Sparsity is a technique to further reduce the memory footprint of deep learning
 
 ## Examples
 
-- [PTQ for LLMs](./llm_ptq/README.md) covers how to use Post-training quantization (PTQ) for popular pre-trained [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) and [Hugging Face](https://huggingface.co/docs/hub/en/models-the-hub) models, export to [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) for deployment.
-- [PTQ for Diffusers](./diffusers/README.md) walks through how to quantize a diffusion model with FP8 or INT8, export to ONNX, and deploy with [TensorRT](https://github.com/NVIDIA/TensorRT/tree/release/10.0/demo/Diffusion). The Diffusers example in this repo is complementary to the [demoDiffusion example in TensorRT repo](https://github.com/NVIDIA/TensorRT/tree/release/9.3/demo/Diffusion#introduction) and includes FP8 plugins as well as the latest updates on INT8 quantization.
+- [PTQ for LLMs](./llm_ptq/README.md) covers how to use Post-training quantization (PTQ) and export to [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) for deployment for popular pre-trained models from frameworks like
+  - [Hugging Face](https://huggingface.co/docs/hub/en/models-the-hub)
+  - [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)
+  - [NVIDIA Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
+  - [Medusa](https://github.com/FasterDecoding/Medusa)
+- [PTQ for Diffusers](./diffusers/quantization/README.md) walks through how to quantize a diffusion model with FP8 or INT8, export to ONNX, and deploy with [TensorRT](https://github.com/NVIDIA/TensorRT/tree/release/10.0/demo/Diffusion). The Diffusers example in this repo is complementary to the [demoDiffusion example in TensorRT repo](https://github.com/NVIDIA/TensorRT/tree/release/10.0/demo/Diffusion#introduction) and includes FP8 plugins as well as the latest updates on INT8 quantization.
 - [QAT for LLMs](./llm_qat/README.md) demonstrates the recipe and workflow for Quantization-aware Training (QAT), which can further preserve model accuracy at low precisions (e.g., INT4, or 4-bit in [NVIDIA Blackwell platform](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)).
 - [Sparsity for LLMs](./llm_sparsity/README.md) shows how to perform Post-training Sparsification and Sparsity-aware fine-tuning on a pre-trained Hugging Face model.
 - [ONNX PTQ](./onnx_ptq/README.md) shows how to quantize the ONNX models in INT4 or INT8 quantization mode. The examples also include the deployment of quantized ONNX models using TensorRT.
 
@@ -0,0 +1,65 @@
+# Cache Diffusion
+
+## News
+
+- [Utilizing DeepCache to Accelerate Stable Diffusion-XL Benchmarks in MLPerf Yields Leading Results](https://developer.nvidia.com/blog/nvidia-h200-tensor-core-gpus-and-nvidia-tensorrt-llm-set-mlperf-llm-inference-records/)
+
+## Introduction
+
+| Supported Framework | Supported Models |
+|----------|----------|
+| **PyTorch** | [**PixArt-α**](https://huggingface.co/PixArt-alpha/PixArt-XL-2-1024-MS), [**Stable Diffusion - XL**](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), [**SVD**](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) |
+| **TensorRT** | **WIP** |
+
+Cache Diffusion methods, such as [DeepCache](https://arxiv.org/abs/2312.00858), [Block Caching](https://arxiv.org/abs/2312.03209) and [T-Gate](https://arxiv.org/abs/2404.02747), optimize performance by reusing cached outputs from previous steps instead of recalculating them. This **training-free** caching approach is compatible with a variety of models, like **DiT** and **UNet**, enabling considerable acceleration without compromising quality.
+
+<p align="center">
+  <img src="./assets/sdxl_cache.png" width="900"/>
+</p>
+<p align="center">
+  This diagram shows the default SDXL Cache compute graph in this example.
+  Significant speedup is achieve through skipping certain blocks at the specific steps.
+</p>
+
+## Quick Start
+
+1. Install the required packages:
+
+```bash
+pip install -r requirements.txt
+```
+
+2. Refer to the provided [example.ipynb](./example.ipynb) for detailed instructions on using cache diffusion.
+
+Using our API, users can create various compute graphs by simply adjusting the parameters. For instance, the default parameter for SDXL is:
+
+```python
+SDXL_DEFAULT_CONFIG = [
+    {
+        "wildcard_or_filter_func": lambda name: "up_blocks.2" not in name,
+        "select_cache_step_func": lambda step: (step % 2) != 0,
+    }
+]
+
+cachify.prepare(pipe, num_inference_steps, SDXL_DEFAULT_CONFIG)
+```
+
+Two parameters are essential: `wildcard_or_filter_func` and `select_cache_step_func`.
+
+`wildcard_or_filter_func`: This can be a **str** or a **function**. If the module matches the given str or filter_func, then it will perform the cache operation. For example, if your input is a string `*up_blocks*`, it will match all names containing `up_blocks` and will perform the cache operation in the future, as you use `fnmatch` to match the string. If you use a function instead, the module name will be passed into the function you provided, and if the function returns True, then it will perform the cache operation.
+
+`select_cache_step_func`: During inference, code will check at each step to see if you want to perform the cache operation based on the `select_cache_step_func` you provided. If `select_cache_step_func(current_step)` returns True, the module will cached; otherwise, it won't.
+
+Multiple configurations can be set up, but ensure that the `wildcard_or_filter_func` works correctly. If you input more than one pair of parameters with the same `wildcard_or_filter_func`, the later one in the list will overwrite the previous ones.
+
+## Demo
+
+The following demo images are generated using `PyTorch==2.3.0 with 1xAda 6000 GPU backend`. TensorRT support will be available in the next ModelOPT release.
+
+Comparing with naively reducing the generation steps, cache diffusion can achieve the same speedup and also much better image quality, even close to the reference image. If the image quality does not meet your needs or product requirements, you can replace our default configuration with your customized settings.
+
+### Stable Diffusion - XL
+
+<p align="center">
+  <img src="./assets/SDXL_Cache_Diffusion_Img.png" />
+</p>
@@ -0,0 +1,93 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: MIT
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+# DEALINGS IN THE SOFTWARE.
+
+import fnmatch
+
+from diffusers.models.attention import FeedForward
+from diffusers.models.attention_processor import Attention
+from diffusers.models.resnet import ResnetBlock2D, TemporalResnetBlock
+from diffusers.pipelines.pixart_alpha.pipeline_pixart_alpha import PixArtAlphaPipeline
+from diffusers.pipelines.stable_diffusion_xl.pipeline_stable_diffusion_xl import (
+    StableDiffusionXLPipeline,
+)
+from diffusers.pipelines.stable_video_diffusion.pipeline_stable_video_diffusion import (
+    StableVideoDiffusionPipeline,
+)
+
+from .module import CachedModule
+from .utils import replace_module
+
+SUPPORTED_METHODS = {PixArtAlphaPipeline, StableDiffusionXLPipeline, StableVideoDiffusionPipeline}
+
+
+def cachify(model, num_inference_steps, config_list):
+    for name, module in model.named_modules():
+        for config in config_list:
+            if _pass(name, config["wildcard_or_filter_func"]) and isinstance(
+                module, (Attention, ResnetBlock2D, TemporalResnetBlock, FeedForward)
+            ):
+                replace_module(
+                    model,
+                    name,
+                    CachedModule(module, num_inference_steps, config["select_cache_step_func"]),
+                )
+
+
+def disable(pipe):
+    model = get_model(pipe)
+    for _, module in model.named_modules():
+        if isinstance(module, CachedModule):
+            module.disable_cache()
+
+
+def enable(pipe):
+    model = get_model(pipe)
+    for _, module in model.named_modules():
+        if isinstance(module, CachedModule):
+            module.enable_cache()
+
+
+def _pass(name, wildcard_or_filter_func):
+    if isinstance(wildcard_or_filter_func, str):
+        return fnmatch.fnmatch(name, wildcard_or_filter_func)
+    elif callable(wildcard_or_filter_func):
+        return wildcard_or_filter_func(name)
+    else:
+        raise NotImplementedError(f"Unsupported type {type(wildcard_or_filter_func)}")
+
+
+def get_model(pipe):
+    if hasattr(pipe, "unet"):
+        model = pipe.unet
+    elif hasattr(pipe, "transformer"):
+        model = pipe.transformer
+    else:
+        raise KeyError
+
+    return model
+
+
+def prepare(pipe, num_inference_steps, config_list):
+    assert pipe.__class__ in SUPPORTED_METHODS, f"{pipe.__class__} is not supported!"
+
+    model = get_model(pipe)
+
+    cachify(model, num_inference_steps, config_list)
@@ -0,0 +1,59 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: MIT
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+# DEALINGS IN THE SOFTWARE.
+
+from torch import nn
+
+
+class CachedModule(nn.Module):
+    def __init__(self, block, num_inference_steps, select_cache_step_func) -> None:
+        super().__init__()
+        self.block = block
+        self.num_inference_steps = num_inference_steps
+        self.select_cache_step_func = select_cache_step_func
+        self.cur_step = 0
+        self.cached_results = None
+        self.enabled = True
+
+    def __getattr__(self, name):
+        try:
+            return super().__getattr__(name)
+        except AttributeError:
+            return getattr(self.block, name)
+
+    def if_cache(self):
+        return self.select_cache_step_func(self.cur_step) and self.enabled
+
+    def enable_cache(self):
+        self.enabled = True
+
+    def disable_cache(self):
+        self.enabled = False
+        self.cur_step = 0
+
+    def reset_num_inference_steps(self, new_step):
+        self.num_inference_steps = new_step
+
+    def forward(self, *args, **kwargs):
+        if not self.if_cache():
+            self.cached_results = self.block(*args, **kwargs)
+        if self.enabled:
+            self.cur_step = (self.cur_step + 1) % self.num_inference_steps
+        return self.cached_results
@@ -0,0 +1,52 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: MIT
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+# DEALINGS IN THE SOFTWARE.
+
+import re
+
+SDXL_DEFAULT_CONFIG = [
+    {
+        "wildcard_or_filter_func": lambda name: "up_blocks.2" not in name,
+        "select_cache_step_func": lambda step: (step % 2) != 0,
+    }
+]
+
+PIXART_DEFAULT_CONFIG = [
+    {
+        "wildcard_or_filter_func": lambda name: not re.search(
+            r"transformer_blocks\.(2[1-7])\.", name
+        ),
+        "select_cache_step_func": lambda step: (step % 3) != 0,
+    }
+]
+
+SVD_DEFAULT_CONFIG = [
+    {
+        "wildcard_or_filter_func": lambda name: "up_blocks.3" not in name,
+        "select_cache_step_func": lambda step: (step % 2) != 0,
+    }
+]
+
+
+def replace_module(parent, name_path, new_module):
+    path_parts = name_path.split(".")
+    for part in path_parts[:-1]:
+        parent = getattr(parent, part)
+    setattr(parent, path_parts[-1], new_module)