Skip to content

Commit 693db7d

Browse files
authored
Merge branch 'main' into formats
2 parents 438ca08 + c2e5ece commit 693db7d

File tree

370 files changed

+4941
-866
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

370 files changed

+4941
-866
lines changed

.github/workflows/nightly_tests.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -340,6 +340,9 @@ jobs:
340340
- backend: "optimum_quanto"
341341
test_location: "quanto"
342342
additional_deps: []
343+
- backend: "nvidia_modelopt"
344+
test_location: "modelopt"
345+
additional_deps: []
343346
runs-on:
344347
group: aws-g6e-xlarge-plus
345348
container:

docs/source/en/_toctree.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
- local: using-diffusers/other-formats
3030
title: Model formats
3131
- local: using-diffusers/push_to_hub
32-
title: Push files to the Hub
32+
title: Sharing pipelines and models
3333

3434
- title: Adapters
3535
isExpanded: false
@@ -188,6 +188,8 @@
188188
title: torchao
189189
- local: quantization/quanto
190190
title: quanto
191+
- local: quantization/modelopt
192+
title: NVIDIA ModelOpt
191193

192194
- title: Model accelerators and hardware
193195
isExpanded: false

docs/source/en/api/pipelines/qwenimage.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,12 @@ The `guidance_scale` parameter in the pipeline is there to support future guidan
120120
- all
121121
- __call__
122122

123+
## QwenImageEditInpaintPipeline
124+
125+
[[autodoc]] QwenImageEditInpaintPipeline
126+
- all
127+
- __call__
128+
123129
## QwenImaggeControlNetPipeline
124130
- all
125131
- __call__

docs/source/en/modular_diffusers/components_manager.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -51,10 +51,10 @@ t2i_pipeline = t2i_blocks.init_pipeline(modular_repo_id, components_manager=comp
5151
</hfoption>
5252
</hfoptions>
5353

54-
Components are only loaded and registered when using [`~ModularPipeline.load_components`] or [`~ModularPipeline.load_default_components`]. The example below uses [`~ModularPipeline.load_default_components`] to create a second pipeline that reuses all the components from the first one, and assigns it to a different collection
54+
Components are only loaded and registered when using [`~ModularPipeline.load_components`] or [`~ModularPipeline.load_components`]. The example below uses [`~ModularPipeline.load_components`] to create a second pipeline that reuses all the components from the first one, and assigns it to a different collection
5555

5656
```py
57-
pipe.load_default_components()
57+
pipe.load_components()
5858
pipe2 = ModularPipeline.from_pretrained("YiYiXu/modular-demo-auto", components_manager=comp, collection="test2")
5959
```
6060

@@ -187,4 +187,4 @@ comp.enable_auto_cpu_offload(device="cuda")
187187

188188
All models begin on the CPU and [`ComponentsManager`] moves them to the appropriate device right before they're needed, and moves other models back to the CPU when GPU memory is low.
189189

190-
You can set your own rules for which models to offload first.
190+
You can set your own rules for which models to offload first.

docs/source/en/modular_diffusers/guiders.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -75,13 +75,13 @@ Guiders that are already saved on the Hub with a `modular_model_index.json` file
7575
}
7676
```
7777

78-
The guider is only created after calling [`~ModularPipeline.load_default_components`] based on the loading specification in `modular_model_index.json`.
78+
The guider is only created after calling [`~ModularPipeline.load_components`] based on the loading specification in `modular_model_index.json`.
7979

8080
```py
8181
t2i_pipeline = t2i_blocks.init_pipeline("YiYiXu/modular-doc-guider")
8282
# not created during init
8383
assert t2i_pipeline.guider is None
84-
t2i_pipeline.load_default_components()
84+
t2i_pipeline.load_components()
8585
# loaded as PAG guider
8686
t2i_pipeline.guider
8787
```
@@ -172,4 +172,4 @@ t2i_pipeline.push_to_hub("YiYiXu/modular-doc-guider")
172172
```
173173

174174
</hfoption>
175-
</hfoptions>
175+
</hfoptions>

docs/source/en/modular_diffusers/modular_pipeline.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ blocks = SequentialPipelineBlocks.from_blocks_dict(TEXT2IMAGE_BLOCKS)
2929
modular_repo_id = "YiYiXu/modular-loader-t2i-0704"
3030
pipeline = blocks.init_pipeline(modular_repo_id)
3131

32-
pipeline.load_default_components(torch_dtype=torch.float16)
32+
pipeline.load_components(torch_dtype=torch.float16)
3333
pipeline.to("cuda")
3434

3535
image = pipeline(prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", output="images")[0]
@@ -49,7 +49,7 @@ blocks = SequentialPipelineBlocks.from_blocks_dict(IMAGE2IMAGE_BLOCKS)
4949
modular_repo_id = "YiYiXu/modular-loader-t2i-0704"
5050
pipeline = blocks.init_pipeline(modular_repo_id)
5151

52-
pipeline.load_default_components(torch_dtype=torch.float16)
52+
pipeline.load_components(torch_dtype=torch.float16)
5353
pipeline.to("cuda")
5454

5555
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
@@ -73,7 +73,7 @@ blocks = SequentialPipelineBlocks.from_blocks_dict(INPAINT_BLOCKS)
7373
modular_repo_id = "YiYiXu/modular-loader-t2i-0704"
7474
pipeline = blocks.init_pipeline(modular_repo_id)
7575

76-
pipeline.load_default_components(torch_dtype=torch.float16)
76+
pipeline.load_components(torch_dtype=torch.float16)
7777
pipeline.to("cuda")
7878

7979
img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
@@ -176,15 +176,15 @@ diffdiff_pipeline = ModularPipeline.from_pretrained(modular_repo_id, trust_remot
176176

177177
## Loading components
178178

179-
A [`ModularPipeline`] doesn't automatically instantiate with components. It only loads the configuration and component specifications. You can load all components with [`~ModularPipeline.load_default_components`] or only load specific components with [`~ModularPipeline.load_components`].
179+
A [`ModularPipeline`] doesn't automatically instantiate with components. It only loads the configuration and component specifications. You can load all components with [`~ModularPipeline.load_components`] or only load specific components with [`~ModularPipeline.load_components`].
180180

181181
<hfoptions id="load">
182-
<hfoption id="load_default_components">
182+
<hfoption id="load_components">
183183

184184
```py
185185
import torch
186186

187-
t2i_pipeline.load_default_components(torch_dtype=torch.float16)
187+
t2i_pipeline.load_components(torch_dtype=torch.float16)
188188
t2i_pipeline.to("cuda")
189189
```
190190

@@ -355,4 +355,4 @@ The [config.json](https://huggingface.co/YiYiXu/modular-diffdiff-0704/blob/main/
355355
"ModularPipelineBlocks": "block.DiffDiffBlocks"
356356
}
357357
}
358-
```
358+
```

docs/source/en/modular_diffusers/quickstart.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -173,9 +173,9 @@ print(dd_blocks)
173173

174174
## ModularPipeline
175175

176-
Convert the [`SequentialPipelineBlocks`] into a [`ModularPipeline`] with the [`ModularPipeline.init_pipeline`] method. This initializes the expected components to load from a `modular_model_index.json` file. Explicitly load the components by calling [`ModularPipeline.load_default_components`].
176+
Convert the [`SequentialPipelineBlocks`] into a [`ModularPipeline`] with the [`ModularPipeline.init_pipeline`] method. This initializes the expected components to load from a `modular_model_index.json` file. Explicitly load the components by calling [`ModularPipeline.load_components`].
177177

178-
It is a good idea to initialize the [`ComponentManager`] with the pipeline to help manage the different components. Once you call [`~ModularPipeline.load_default_components`], the components are registered to the [`ComponentManager`] and can be shared between workflows. The example below uses the `collection` argument to assign the components a `"diffdiff"` label for better organization.
178+
It is a good idea to initialize the [`ComponentManager`] with the pipeline to help manage the different components. Once you call [`~ModularPipeline.load_components`], the components are registered to the [`ComponentManager`] and can be shared between workflows. The example below uses the `collection` argument to assign the components a `"diffdiff"` label for better organization.
179179

180180
```py
181181
from diffusers.modular_pipelines import ComponentsManager
@@ -209,11 +209,11 @@ Use the [`sub_blocks.insert`] method to insert it into the [`ModularPipeline`].
209209
dd_blocks.sub_blocks.insert("ip_adapter", ip_adapter_block, 0)
210210
```
211211

212-
Call [`~ModularPipeline.init_pipeline`] to initialize a [`ModularPipeline`] and use [`~ModularPipeline.load_default_components`] to load the model components. Load and set the IP-Adapter to run the pipeline.
212+
Call [`~ModularPipeline.init_pipeline`] to initialize a [`ModularPipeline`] and use [`~ModularPipeline.load_components`] to load the model components. Load and set the IP-Adapter to run the pipeline.
213213

214214
```py
215215
dd_pipeline = dd_blocks.init_pipeline("YiYiXu/modular-demo-auto", collection="diffdiff")
216-
dd_pipeline.load_default_components(torch_dtype=torch.float16)
216+
dd_pipeline.load_components(torch_dtype=torch.float16)
217217
dd_pipeline.loader.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
218218
dd_pipeline.loader.set_ip_adapter_scale(0.6)
219219
dd_pipeline = dd_pipeline.to(device)
@@ -260,14 +260,14 @@ class SDXLDiffDiffControlNetDenoiseStep(StableDiffusionXLDenoiseLoopWrapper):
260260
controlnet_denoise_block = SDXLDiffDiffControlNetDenoiseStep()
261261
```
262262

263-
Insert the `controlnet_input` block and replace the `denoise` block with the new `controlnet_denoise_block`. Initialize a [`ModularPipeline`] and [`~ModularPipeline.load_default_components`] into it.
263+
Insert the `controlnet_input` block and replace the `denoise` block with the new `controlnet_denoise_block`. Initialize a [`ModularPipeline`] and [`~ModularPipeline.load_components`] into it.
264264

265265
```py
266266
dd_blocks.sub_blocks.insert("controlnet_input", control_input_block, 7)
267267
dd_blocks.sub_blocks["denoise"] = controlnet_denoise_block
268268

269269
dd_pipeline = dd_blocks.init_pipeline("YiYiXu/modular-demo-auto", collection="diffdiff")
270-
dd_pipeline.load_default_components(torch_dtype=torch.float16)
270+
dd_pipeline.load_components(torch_dtype=torch.float16)
271271
dd_pipeline = dd_pipeline.to(device)
272272

273273
control_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/diffdiff_tomato_canny.jpeg")
@@ -320,7 +320,7 @@ Call [`SequentialPipelineBlocks.from_blocks_dict`] to create a [`SequentialPipel
320320
```py
321321
dd_auto_blocks = SequentialPipelineBlocks.from_blocks_dict(DIFFDIFF_AUTO_BLOCKS)
322322
dd_pipeline = dd_auto_blocks.init_pipeline("YiYiXu/modular-demo-auto", collection="diffdiff")
323-
dd_pipeline.load_default_components(torch_dtype=torch.float16)
323+
dd_pipeline.load_components(torch_dtype=torch.float16)
324324
```
325325

326326
## Share
@@ -340,5 +340,5 @@ from diffusers.modular_pipelines import ModularPipeline, ComponentsManager
340340
components = ComponentsManager()
341341

342342
diffdiff_pipeline = ModularPipeline.from_pretrained("YiYiXu/modular-diffdiff-0704", trust_remote_code=True, components_manager=components, collection="diffdiff")
343-
diffdiff_pipeline.load_default_components(torch_dtype=torch.float16)
344-
```
343+
diffdiff_pipeline.load_components(torch_dtype=torch.float16)
344+
```
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# NVIDIA ModelOpt
13+
14+
[NVIDIA-ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.
15+
16+
Before you begin, make sure you have nvidia_modelopt installed.
17+
18+
```bash
19+
pip install -U "nvidia_modelopt[hf]"
20+
```
21+
22+
Quantize a model by passing [`NVIDIAModelOptConfig`] to [`~ModelMixin.from_pretrained`] (you can also load pre-quantized models). This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.
23+
24+
The example below only quantizes the weights to FP8.
25+
26+
```python
27+
import torch
28+
from diffusers import AutoModel, SanaPipeline, NVIDIAModelOptConfig
29+
30+
model_id = "Efficient-Large-Model/Sana_600M_1024px_diffusers"
31+
dtype = torch.bfloat16
32+
33+
quantization_config = NVIDIAModelOptConfig(quant_type="FP8", quant_method="modelopt")
34+
transformer = AutoModel.from_pretrained(
35+
model_id,
36+
subfolder="transformer",
37+
quantization_config=quantization_config,
38+
torch_dtype=dtype,
39+
)
40+
pipe = SanaPipeline.from_pretrained(
41+
model_id,
42+
transformer=transformer,
43+
torch_dtype=dtype,
44+
)
45+
pipe.to("cuda")
46+
47+
print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB")
48+
49+
prompt = "A cat holding a sign that says hello world"
50+
image = pipe(
51+
prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512
52+
).images[0]
53+
image.save("output.png")
54+
```
55+
56+
> **Note:**
57+
>
58+
> The quantization methods in NVIDIA-ModelOpt are designed to reduce the memory footprint of model weights using various QAT (Quantization-Aware Training) and PTQ (Post-Training Quantization) techniques while maintaining model performance. However, the actual performance gain during inference depends on the deployment framework (e.g., TRT-LLM, TensorRT) and the specific hardware configuration.
59+
>
60+
> More details can be found [here](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples).
61+
62+
## NVIDIAModelOptConfig
63+
64+
The `NVIDIAModelOptConfig` class accepts three parameters:
65+
- `quant_type`: A string value mentioning one of the quantization types below.
66+
- `modules_to_not_convert`: A list of module full/partial module names for which quantization should not be performed. For example, to not perform any quantization of the [`SD3Transformer2DModel`]'s pos_embed projection blocks, one would specify: `modules_to_not_convert=["pos_embed.proj.weight"]`.
67+
- `disable_conv_quantization`: A boolean value which when set to `True` disables quantization for all convolutional layers in the model. This is useful as channel and block quantization generally don't work well with convolutional layers (used with INT4, NF4, NVFP4). If you want to disable quantization for specific convolutional layers, use `modules_to_not_convert` instead.
68+
- `algorithm`: The algorithm to use for determining scale, defaults to `"max"`. You can check modelopt documentation for more algorithms and details.
69+
- `forward_loop`: The forward loop function to use for calibrating activation during quantization. If not provided, it relies on static scale values computed using the weights only.
70+
- `kwargs`: A dict of keyword arguments to pass to the underlying quantization method which will be invoked based on `quant_type`.
71+
72+
## Supported quantization types
73+
74+
ModelOpt supports weight-only, channel and block quantization int8, fp8, int4, nf4, and nvfp4. The quantization methods are designed to reduce the memory footprint of the model weights while maintaining the performance of the model during inference.
75+
76+
Weight-only quantization stores the model weights in a specific low-bit data type but performs computation with a higher-precision data type, like `bfloat16`. This lowers the memory requirements from model weights but retains the memory peaks for activation computation.
77+
78+
The quantization methods supported are as follows:
79+
80+
| **Quantization Type** | **Supported Schemes** | **Required Kwargs** | **Additional Notes** |
81+
|-----------------------|-----------------------|---------------------|----------------------|
82+
| **INT8** | `int8 weight only`, `int8 channel quantization`, `int8 block quantization` | `quant_type`, `quant_type + channel_quantize`, `quant_type + channel_quantize + block_quantize` |
83+
| **FP8** | `fp8 weight only`, `fp8 channel quantization`, `fp8 block quantization` | `quant_type`, `quant_type + channel_quantize`, `quant_type + channel_quantize + block_quantize` |
84+
| **INT4** | `int4 weight only`, `int4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`|
85+
| **NF4** | `nf4 weight only`, `nf4 double block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize + scale_channel_quantize` + `scale_block_quantize` | `channel_quantize = -1 and scale_channel_quantize = -1 are only supported for now` |
86+
| **NVFP4** | `nvfp4 weight only`, `nvfp4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`|
87+
88+
89+
Refer to the [official modelopt documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/) for a better understanding of the available quantization methods and the exhaustive list of configuration options available.
90+
91+
## Serializing and Deserializing quantized models
92+
93+
To serialize a quantized model in a given dtype, first load the model with the desired quantization dtype and then save it using the [`~ModelMixin.save_pretrained`] method.
94+
95+
```python
96+
import torch
97+
from diffusers import AutoModel, NVIDIAModelOptConfig
98+
from modelopt.torch.opt import enable_huggingface_checkpointing
99+
100+
enable_huggingface_checkpointing()
101+
102+
model_id = "Efficient-Large-Model/Sana_600M_1024px_diffusers"
103+
quant_config_fp8 = {"quant_type": "FP8", "quant_method": "modelopt"}
104+
quant_config_fp8 = NVIDIAModelOptConfig(**quant_config_fp8)
105+
model = AutoModel.from_pretrained(
106+
model_id,
107+
subfolder="transformer",
108+
quantization_config=quant_config_fp8,
109+
torch_dtype=torch.bfloat16,
110+
)
111+
model.save_pretrained('path/to/sana_fp8', safe_serialization=False)
112+
```
113+
114+
To load a serialized quantized model, use the [`~ModelMixin.from_pretrained`] method.
115+
116+
```python
117+
import torch
118+
from diffusers import AutoModel, NVIDIAModelOptConfig, SanaPipeline
119+
from modelopt.torch.opt import enable_huggingface_checkpointing
120+
121+
enable_huggingface_checkpointing()
122+
123+
quantization_config = NVIDIAModelOptConfig(quant_type="FP8", quant_method="modelopt")
124+
transformer = AutoModel.from_pretrained(
125+
"path/to/sana_fp8",
126+
subfolder="transformer",
127+
quantization_config=quantization_config,
128+
torch_dtype=torch.bfloat16,
129+
)
130+
pipe = SanaPipeline.from_pretrained(
131+
"Efficient-Large-Model/Sana_600M_1024px_diffusers",
132+
transformer=transformer,
133+
torch_dtype=torch.bfloat16,
134+
)
135+
pipe.to("cuda")
136+
prompt = "A cat holding a sign that says hello world"
137+
image = pipe(
138+
prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512
139+
).images[0]
140+
image.save("output.png")
141+
```

docs/source/en/training/distributed_inference.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -223,7 +223,7 @@ from diffusers.image_processor import VaeImageProcessor
223223
import torch
224224

225225
vae = AutoencoderKL.from_pretrained(ckpt_id, subfolder="vae", torch_dtype=torch.bfloat16).to("cuda")
226-
vae_scale_factor = 2 ** (len(vae.config.block_out_channels))
226+
vae_scale_factor = 2 ** (len(vae.config.block_out_channels) - 1)
227227
image_processor = VaeImageProcessor(vae_scale_factor=vae_scale_factor)
228228

229229
with torch.no_grad():

0 commit comments

Comments
 (0)