Skip to content

Commit ecbd907

Browse files
authored
Merge branch 'main' into requirements-custom-blocks
2 parents d159ae0 + 5e181ed commit ecbd907

File tree

136 files changed

+8078
-339
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

136 files changed

+8078
-339
lines changed

.github/workflows/nightly_tests.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -340,6 +340,9 @@ jobs:
340340
- backend: "optimum_quanto"
341341
test_location: "quanto"
342342
additional_deps: []
343+
- backend: "nvidia_modelopt"
344+
test_location: "modelopt"
345+
additional_deps: []
343346
runs-on:
344347
group: aws-g6e-xlarge-plus
345348
container:

docs/source/en/_toctree.yml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -24,12 +24,14 @@
2424
title: Reproducibility
2525
- local: using-diffusers/schedulers
2626
title: Load schedulers and models
27+
- local: using-diffusers/models
28+
title: Models
2729
- local: using-diffusers/scheduler_features
2830
title: Scheduler features
2931
- local: using-diffusers/other-formats
3032
title: Model files and layouts
3133
- local: using-diffusers/push_to_hub
32-
title: Push files to the Hub
34+
title: Sharing pipelines and models
3335

3436
- title: Adapters
3537
isExpanded: false
@@ -58,12 +60,6 @@
5860
title: Batch inference
5961
- local: training/distributed_inference
6062
title: Distributed inference
61-
- local: using-diffusers/scheduler_features
62-
title: Scheduler features
63-
- local: using-diffusers/callback
64-
title: Pipeline callbacks
65-
- local: using-diffusers/image_quality
66-
title: Controlling image quality
6763

6864
- title: Inference optimization
6965
isExpanded: false
@@ -92,6 +88,8 @@
9288
title: xDiT
9389
- local: optimization/para_attn
9490
title: ParaAttention
91+
- local: using-diffusers/image_quality
92+
title: FreeU
9593

9694
- title: Hybrid Inference
9795
isExpanded: false
@@ -188,6 +186,8 @@
188186
title: torchao
189187
- local: quantization/quanto
190188
title: quanto
189+
- local: quantization/modelopt
190+
title: NVIDIA ModelOpt
191191

192192
- title: Model accelerators and hardware
193193
isExpanded: false

docs/source/en/api/image_processor.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,12 @@ All pipelines with [`VaeImageProcessor`] accept PIL Image, PyTorch tensor, or Nu
2020

2121
[[autodoc]] image_processor.VaeImageProcessor
2222

23+
## InpaintProcessor
24+
25+
The [`InpaintProcessor`] accepts `mask` and `image` inputs and process them together. Optionally, it can accept padding_mask_crop and apply mask overlay.
26+
27+
[[autodoc]] image_processor.InpaintProcessor
28+
2329
## VaeImageProcessorLDM3D
2430

2531
The [`VaeImageProcessorLDM3D`] accepts RGB and depth inputs and returns RGB and depth outputs.

docs/source/en/api/pipelines/cogvideox.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ from diffusers.utils import export_to_video
5050
pipeline_quant_config = PipelineQuantizationConfig(
5151
quant_backend="torchao",
5252
quant_kwargs={"quant_type": "int8wo"},
53-
components_to_quantize=["transformer"]
53+
components_to_quantize="transformer"
5454
)
5555

5656
# fp8 layerwise weight-casting

docs/source/en/api/pipelines/hunyuan_video.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ pipeline_quant_config = PipelineQuantizationConfig(
5454
"bnb_4bit_quant_type": "nf4",
5555
"bnb_4bit_compute_dtype": torch.bfloat16
5656
},
57-
components_to_quantize=["transformer"]
57+
components_to_quantize="transformer"
5858
)
5959

6060
pipeline = HunyuanVideoPipeline.from_pretrained(
@@ -91,7 +91,7 @@ pipeline_quant_config = PipelineQuantizationConfig(
9191
"bnb_4bit_quant_type": "nf4",
9292
"bnb_4bit_compute_dtype": torch.bfloat16
9393
},
94-
components_to_quantize=["transformer"]
94+
components_to_quantize="transformer"
9595
)
9696

9797
pipeline = HunyuanVideoPipeline.from_pretrained(
@@ -139,7 +139,7 @@ export_to_video(video, "output.mp4", fps=15)
139139
"bnb_4bit_quant_type": "nf4",
140140
"bnb_4bit_compute_dtype": torch.bfloat16
141141
},
142-
components_to_quantize=["transformer"]
142+
components_to_quantize="transformer"
143143
)
144144

145145
pipeline = HunyuanVideoPipeline.from_pretrained(

docs/source/en/optimization/memory.md

Lines changed: 46 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -291,13 +291,53 @@ Group offloading moves groups of internal layers ([torch.nn.ModuleList](https://
291291
> [!WARNING]
292292
> Group offloading may not work with all models if the forward implementation contains weight-dependent device casting of inputs because it may clash with group offloading's device casting mechanism.
293293
294-
Call [`~ModelMixin.enable_group_offload`] to enable it for standard Diffusers model components that inherit from [`ModelMixin`]. For other model components that don't inherit from [`ModelMixin`], such as a generic [torch.nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html), use [`~hooks.apply_group_offloading`] instead.
295-
296-
The `offload_type` parameter can be set to `block_level` or `leaf_level`.
294+
Enable group offloading by configuring the `offload_type` parameter to `block_level` or `leaf_level`.
297295

298296
- `block_level` offloads groups of layers based on the `num_blocks_per_group` parameter. For example, if `num_blocks_per_group=2` on a model with 40 layers, 2 layers are onloaded and offloaded at a time (20 total onloads/offloads). This drastically reduces memory requirements.
299297
- `leaf_level` offloads individual layers at the lowest level and is equivalent to [CPU offloading](#cpu-offloading). But it can be made faster if you use streams without giving up inference speed.
300298

299+
Group offloading is supported for entire pipelines or individual models. Applying group offloading to the entire pipeline is the easiest option while selectively applying it to individual models gives users more flexibility to use different offloading techniques for different models.
300+
301+
<hfoptions id="group-offloading">
302+
<hfoption id="pipeline">
303+
304+
Call [`~DiffusionPipeline.enable_group_offload`] on a pipeline.
305+
306+
```py
307+
import torch
308+
from diffusers import CogVideoXPipeline
309+
from diffusers.hooks import apply_group_offloading
310+
from diffusers.utils import export_to_video
311+
312+
onload_device = torch.device("cuda")
313+
offload_device = torch.device("cpu")
314+
315+
pipeline = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
316+
pipeline.enable_group_offload(
317+
onload_device=onload_device,
318+
offload_device=offload_device,
319+
offload_type="leaf_level",
320+
use_stream=True
321+
)
322+
323+
prompt = (
324+
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
325+
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
326+
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
327+
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
328+
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
329+
"atmosphere of this unique musical performance."
330+
)
331+
video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
332+
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
333+
export_to_video(video, "output.mp4", fps=8)
334+
```
335+
336+
</hfoption>
337+
<hfoption id="model">
338+
339+
Call [`~ModelMixin.enable_group_offload`] on standard Diffusers model components that inherit from [`ModelMixin`]. For other model components that don't inherit from [`ModelMixin`], such as a generic [torch.nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html), use [`~hooks.apply_group_offloading`] instead.
340+
301341
```py
302342
import torch
303343
from diffusers import CogVideoXPipeline
@@ -328,6 +368,9 @@ print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} G
328368
export_to_video(video, "output.mp4", fps=8)
329369
```
330370

371+
</hfoption>
372+
</hfoptions>
373+
331374
#### CUDA stream
332375

333376
The `use_stream` parameter can be activated for CUDA devices that support asynchronous data transfer streams to reduce overall execution time compared to [CPU offloading](#cpu-offloading). It overlaps data transfer and computation by using layer prefetching. The next layer to be executed is loaded onto the GPU while the current layer is still being executed. It can increase CPU memory significantly so ensure you have 2x the amount of memory as the model size.
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# NVIDIA ModelOpt
13+
14+
[NVIDIA-ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.
15+
16+
Before you begin, make sure you have nvidia_modelopt installed.
17+
18+
```bash
19+
pip install -U "nvidia_modelopt[hf]"
20+
```
21+
22+
Quantize a model by passing [`NVIDIAModelOptConfig`] to [`~ModelMixin.from_pretrained`] (you can also load pre-quantized models). This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
23+
24+
The example below only quantizes the weights to FP8.
25+
26+
```python
27+
import torch
28+
from diffusers import AutoModel, SanaPipeline, NVIDIAModelOptConfig
29+
30+
model_id = "Efficient-Large-Model/Sana_600M_1024px_diffusers"
31+
dtype = torch.bfloat16
32+
33+
quantization_config = NVIDIAModelOptConfig(quant_type="FP8", quant_method="modelopt")
34+
transformer = AutoModel.from_pretrained(
35+
model_id,
36+
subfolder="transformer",
37+
quantization_config=quantization_config,
38+
torch_dtype=dtype,
39+
)
40+
pipe = SanaPipeline.from_pretrained(
41+
model_id,
42+
transformer=transformer,
43+
torch_dtype=dtype,
44+
)
45+
pipe.to("cuda")
46+
47+
print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB")
48+
49+
prompt = "A cat holding a sign that says hello world"
50+
image = pipe(
51+
prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512
52+
).images[0]
53+
image.save("output.png")
54+
```
55+
56+
> **Note:**
57+
>
58+
> The quantization methods in NVIDIA-ModelOpt are designed to reduce the memory footprint of model weights using various QAT (Quantization-Aware Training) and PTQ (Post-Training Quantization) techniques while maintaining model performance. However, the actual performance gain during inference depends on the deployment framework (e.g., TRT-LLM, TensorRT) and the specific hardware configuration.
59+
>
60+
> More details can be found [here](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples).
61+
62+
## NVIDIAModelOptConfig
63+
64+
The `NVIDIAModelOptConfig` class accepts three parameters:
65+
- `quant_type`: A string value mentioning one of the quantization types below.
66+
- `modules_to_not_convert`: A list of module full/partial module names for which quantization should not be performed. For example, to not perform any quantization of the [`SD3Transformer2DModel`]'s pos_embed projection blocks, one would specify: `modules_to_not_convert=["pos_embed.proj.weight"]`.
67+
- `disable_conv_quantization`: A boolean value which when set to `True` disables quantization for all convolutional layers in the model. This is useful as channel and block quantization generally don't work well with convolutional layers (used with INT4, NF4, NVFP4). If you want to disable quantization for specific convolutional layers, use `modules_to_not_convert` instead.
68+
- `algorithm`: The algorithm to use for determining scale, defaults to `"max"`. You can check modelopt documentation for more algorithms and details.
69+
- `forward_loop`: The forward loop function to use for calibrating activation during quantization. If not provided, it relies on static scale values computed using the weights only.
70+
- `kwargs`: A dict of keyword arguments to pass to the underlying quantization method which will be invoked based on `quant_type`.
71+
72+
## Supported quantization types
73+
74+
ModelOpt supports weight-only, channel and block quantization int8, fp8, int4, nf4, and nvfp4. The quantization methods are designed to reduce the memory footprint of the model weights while maintaining the performance of the model during inference.
75+
76+
Weight-only quantization stores the model weights in a specific low-bit data type but performs computation with a higher-precision data type, like `bfloat16`. This lowers the memory requirements from model weights but retains the memory peaks for activation computation.
77+
78+
The quantization methods supported are as follows:
79+
80+
| **Quantization Type** | **Supported Schemes** | **Required Kwargs** | **Additional Notes** |
81+
|-----------------------|-----------------------|---------------------|----------------------|
82+
| **INT8** | `int8 weight only`, `int8 channel quantization`, `int8 block quantization` | `quant_type`, `quant_type + channel_quantize`, `quant_type + channel_quantize + block_quantize` |
83+
| **FP8** | `fp8 weight only`, `fp8 channel quantization`, `fp8 block quantization` | `quant_type`, `quant_type + channel_quantize`, `quant_type + channel_quantize + block_quantize` |
84+
| **INT4** | `int4 weight only`, `int4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`|
85+
| **NF4** | `nf4 weight only`, `nf4 double block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize + scale_channel_quantize` + `scale_block_quantize` | `channel_quantize = -1 and scale_channel_quantize = -1 are only supported for now` |
86+
| **NVFP4** | `nvfp4 weight only`, `nvfp4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`|
87+
88+
89+
Refer to the [official modelopt documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/) for a better understanding of the available quantization methods and the exhaustive list of configuration options available.
90+
91+
## Serializing and Deserializing quantized models
92+
93+
To serialize a quantized model in a given dtype, first load the model with the desired quantization dtype and then save it using the [`~ModelMixin.save_pretrained`] method.
94+
95+
```python
96+
import torch
97+
from diffusers import AutoModel, NVIDIAModelOptConfig
98+
from modelopt.torch.opt import enable_huggingface_checkpointing
99+
100+
enable_huggingface_checkpointing()
101+
102+
model_id = "Efficient-Large-Model/Sana_600M_1024px_diffusers"
103+
quant_config_fp8 = {"quant_type": "FP8", "quant_method": "modelopt"}
104+
quant_config_fp8 = NVIDIAModelOptConfig(**quant_config_fp8)
105+
model = AutoModel.from_pretrained(
106+
model_id,
107+
subfolder="transformer",
108+
quantization_config=quant_config_fp8,
109+
torch_dtype=torch.bfloat16,
110+
)
111+
model.save_pretrained('path/to/sana_fp8', safe_serialization=False)
112+
```
113+
114+
To load a serialized quantized model, use the [`~ModelMixin.from_pretrained`] method.
115+
116+
```python
117+
import torch
118+
from diffusers import AutoModel, NVIDIAModelOptConfig, SanaPipeline
119+
from modelopt.torch.opt import enable_huggingface_checkpointing
120+
121+
enable_huggingface_checkpointing()
122+
123+
quantization_config = NVIDIAModelOptConfig(quant_type="FP8", quant_method="modelopt")
124+
transformer = AutoModel.from_pretrained(
125+
"path/to/sana_fp8",
126+
subfolder="transformer",
127+
quantization_config=quantization_config,
128+
torch_dtype=torch.bfloat16,
129+
)
130+
pipe = SanaPipeline.from_pretrained(
131+
"Efficient-Large-Model/Sana_600M_1024px_diffusers",
132+
transformer=transformer,
133+
torch_dtype=torch.bfloat16,
134+
)
135+
pipe.to("cuda")
136+
prompt = "A cat holding a sign that says hello world"
137+
image = pipe(
138+
prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512
139+
).images[0]
140+
image.save("output.png")
141+
```

docs/source/en/quantization/overview.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,9 @@ Initialize [`~quantizers.PipelineQuantizationConfig`] with the following paramet
3434
> [!TIP]
3535
> These `quant_kwargs` arguments are different for each backend. Refer to the [Quantization API](../api/quantization) docs to view the arguments for each backend.
3636
37-
- `components_to_quantize` specifies which components of the pipeline to quantize. Typically, you should quantize the most compute intensive components like the transformer. The text encoder is another component to consider quantizing if a pipeline has more than one such as [`FluxPipeline`]. The example below quantizes the T5 text encoder in [`FluxPipeline`] while keeping the CLIP model intact.
37+
- `components_to_quantize` specifies which component(s) of the pipeline to quantize. Typically, you should quantize the most compute intensive components like the transformer. The text encoder is another component to consider quantizing if a pipeline has more than one such as [`FluxPipeline`]. The example below quantizes the T5 text encoder in [`FluxPipeline`] while keeping the CLIP model intact.
38+
39+
`components_to_quantize` accepts either a list for multiple models or a string for a single model.
3840

3941
The example below loads the bitsandbytes backend with the following arguments from [`~quantizers.quantization_config.BitsAndBytesConfig`], `load_in_4bit`, `bnb_4bit_quant_type`, and `bnb_4bit_compute_dtype`.
4042

@@ -62,6 +64,7 @@ pipe = DiffusionPipeline.from_pretrained(
6264
image = pipe("photo of a cute dog").images[0]
6365
```
6466

67+
6568
### Advanced quantization
6669

6770
The `quant_mapping` argument provides more options for how to quantize each individual component in a pipeline, like combining different quantization backends.

0 commit comments

Comments
 (0)