Skip to content

Commit 895428b

Browse files
authored
Merge branch 'main' into repr-quant-config
2 parents 17cdc98 + 393aefc commit 895428b

File tree

76 files changed

+9273
-128
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

76 files changed

+9273
-128
lines changed

.github/workflows/pr_tests.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ on:
1111
- "tests/**.py"
1212
- ".github/**.yml"
1313
- "utils/**.py"
14+
- "setup.py"
1415
push:
1516
branches:
1617
- ci-*

docs/source/en/_toctree.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -295,6 +295,8 @@
295295
title: CogView4Transformer2DModel
296296
- local: api/models/consisid_transformer3d
297297
title: ConsisIDTransformer3DModel
298+
- local: api/models/cosmos_transformer3d
299+
title: CosmosTransformer3DModel
298300
- local: api/models/dit_transformer2d
299301
title: DiTTransformer2DModel
300302
- local: api/models/easyanimate_transformer3d
@@ -363,6 +365,8 @@
363365
title: AutoencoderKLAllegro
364366
- local: api/models/autoencoderkl_cogvideox
365367
title: AutoencoderKLCogVideoX
368+
- local: api/models/autoencoderkl_cosmos
369+
title: AutoencoderKLCosmos
366370
- local: api/models/autoencoder_kl_hunyuan_video
367371
title: AutoencoderKLHunyuanVideo
368372
- local: api/models/autoencoderkl_ltx_video
@@ -433,6 +437,8 @@
433437
title: ControlNet-XS with Stable Diffusion XL
434438
- local: api/pipelines/controlnet_union
435439
title: ControlNetUnion
440+
- local: api/pipelines/cosmos
441+
title: Cosmos
436442
- local: api/pipelines/dance_diffusion
437443
title: Dance Diffusion
438444
- local: api/pipelines/ddim
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# AutoencoderKLCosmos
13+
14+
[Cosmos Tokenizers](https://github.com/NVIDIA/Cosmos-Tokenizer).
15+
16+
Supported models:
17+
- [nvidia/Cosmos-1.0-Tokenizer-CV8x8x8](https://huggingface.co/nvidia/Cosmos-1.0-Tokenizer-CV8x8x8)
18+
19+
The model can be loaded with the following code snippet.
20+
21+
```python
22+
from diffusers import AutoencoderKLCosmos
23+
24+
vae = AutoencoderKLCosmos.from_pretrained("nvidia/Cosmos-1.0-Tokenizer-CV8x8x8", subfolder="vae")
25+
```
26+
27+
## AutoencoderKLCosmos
28+
29+
[[autodoc]] AutoencoderKLCosmos
30+
- decode
31+
- encode
32+
- all
33+
34+
## AutoencoderKLOutput
35+
36+
[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput
37+
38+
## DecoderOutput
39+
40+
[[autodoc]] models.autoencoders.vae.DecoderOutput
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# CosmosTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D video-like data was introduced in [Cosmos World Foundation Model Platform for Physical AI](https://huggingface.co/papers/2501.03575) by NVIDIA.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import CosmosTransformer3DModel
20+
21+
transformer = CosmosTransformer3DModel.from_pretrained("nvidia/Cosmos-1.0-Diffusion-7B-Text2World", subfolder="transformer", torch_dtype=torch.bfloat16)
22+
```
23+
24+
## CosmosTransformer3DModel
25+
26+
[[autodoc]] CosmosTransformer3DModel
27+
28+
## Transformer2DModelOutput
29+
30+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License. -->
14+
15+
# Cosmos
16+
17+
[Cosmos World Foundation Model Platform for Physical AI](https://huggingface.co/papers/2501.03575) by NVIDIA.
18+
19+
*Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via https://github.com/NVIDIA/Cosmos.*
20+
21+
<Tip>
22+
23+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
24+
25+
</Tip>
26+
27+
## CosmosTextToWorldPipeline
28+
29+
[[autodoc]] CosmosTextToWorldPipeline
30+
- all
31+
- __call__
32+
33+
## CosmosVideoToWorldPipeline
34+
35+
[[autodoc]] CosmosVideoToWorldPipeline
36+
- all
37+
- __call__
38+
39+
## CosmosPipelineOutput
40+
41+
[[autodoc]] pipelines.cosmos.pipeline_output.CosmosPipelineOutput

docs/source/en/api/pipelines/hunyuan_video.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ The following models are available for the image-to-video pipeline:
5252
| [`Skywork/SkyReels-V1-Hunyuan-I2V`](https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-I2V) | Skywork's custom finetune of HunyuanVideo (de-distilled). Performs best with `97x544x960` resolution. Performs best at `97x544x960` resolution, `guidance_scale=1.0`, `true_cfg_scale=6.0` and a negative prompt. |
5353
| [`hunyuanvideo-community/HunyuanVideo-I2V-33ch`](https://huggingface.co/hunyuanvideo-community/HunyuanVideo-I2V) | Tecent's official HunyuanVideo 33-channel I2V model. Performs best at resolutions of 480, 720, 960, 1280. A higher `shift` value when initializing the scheduler is recommended (good values are between 7 and 20). |
5454
| [`hunyuanvideo-community/HunyuanVideo-I2V`](https://huggingface.co/hunyuanvideo-community/HunyuanVideo-I2V) | Tecent's official HunyuanVideo 16-channel I2V model. Performs best at resolutions of 480, 720, 960, 1280. A higher `shift` value when initializing the scheduler is recommended (good values are between 7 and 20) |
55+
- [`lllyasviel/FramePackI2V_HY`](https://huggingface.co/lllyasviel/FramePackI2V_HY) | lllyasviel's paper introducing a new technique for long-context video generation called [Framepack](https://arxiv.org/abs/2504.12626). |
5556

5657
## Quantization
5758

docs/source/en/quantization/bitsandbytes.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bf
4848
```py
4949
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
5050
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
51-
51+
import torch
5252
from diffusers import AutoModel
5353
from transformers import T5EncoderModel
5454

@@ -88,6 +88,8 @@ Setting `device_map="auto"` automatically fills all available space on the GPU(s
8888
CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.
8989

9090
```py
91+
from diffusers import FluxPipeline
92+
9193
pipe = FluxPipeline.from_pretrained(
9294
"black-forest-labs/FLUX.1-dev",
9395
transformer=transformer_8bit,
@@ -132,7 +134,7 @@ For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bf
132134
```py
133135
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
134136
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
135-
137+
import torch
136138
from diffusers import AutoModel
137139
from transformers import T5EncoderModel
138140

@@ -171,6 +173,8 @@ Let's generate an image using our quantized models.
171173
Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.
172174

173175
```py
176+
from diffusers import FluxPipeline
177+
174178
pipe = FluxPipeline.from_pretrained(
175179
"black-forest-labs/FLUX.1-dev",
176180
transformer=transformer_4bit,
@@ -214,6 +218,8 @@ Check your memory footprint with the `get_memory_footprint` method:
214218
print(model.get_memory_footprint())
215219
```
216220

221+
Note that this only tells you the memory footprint of the model params and does _not_ estimate the inference memory requirements.
222+
217223
Quantized models can be loaded from the [`~ModelMixin.from_pretrained`] method without needing to specify the `quantization_config` parameters:
218224

219225
```py
@@ -413,4 +419,4 @@ transformer_4bit.dequantize()
413419
## Resources
414420

415421
* [End-to-end notebook showing Flux.1 Dev inference in a free-tier Colab](https://gist.github.com/sayakpaul/c76bd845b48759e11687ac550b99d8b4)
416-
* [Training](https://gist.github.com/sayakpaul/05afd428bc089b47af7c016e42004527)
422+
* [Training](https://github.com/huggingface/diffusers/blob/8c661ea586bf11cb2440da740dd3c4cf84679b85/examples/dreambooth/README_hidream.md#using-quantization)

docs/source/en/quantization/torchao.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ The quantization methods supported are as follows:
8585
| **Category** | **Full Function Names** | **Shorthands** |
8686
|--------------|-------------------------|----------------|
8787
| **Integer quantization** | `int4_weight_only`, `int8_dynamic_activation_int4_weight`, `int8_weight_only`, `int8_dynamic_activation_int8_weight` | `int4wo`, `int4dq`, `int8wo`, `int8dq` |
88-
| **Floating point 8-bit quantization** | `float8_weight_only`, `float8_dynamic_activation_float8_weight`, `float8_static_activation_float8_weight` | `float8wo`, `float8wo_e5m2`, `float8wo_e4m3`, `float8dq`, `float8dq_e4m3`, `float8_e4m3_tensor`, `float8_e4m3_row` |
88+
| **Floating point 8-bit quantization** | `float8_weight_only`, `float8_dynamic_activation_float8_weight`, `float8_static_activation_float8_weight` | `float8wo`, `float8wo_e5m2`, `float8wo_e4m3`, `float8dq`, `float8dq_e4m3`, `float8dq_e4m3_tensor`, `float8dq_e4m3_row` |
8989
| **Floating point X-bit quantization** | `fpx_weight_only` | `fpX_eAwB` where `X` is the number of bits (1-7), `A` is exponent bits, and `B` is mantissa bits. Constraint: `X == A + B + 1` |
9090
| **Unsigned Integer quantization** | `uintx_weight_only` | `uint1wo`, `uint2wo`, `uint3wo`, `uint4wo`, `uint5wo`, `uint6wo`, `uint7wo` |
9191

examples/advanced_diffusion_training/train_dreambooth_lora_flux_advanced.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -430,6 +430,9 @@ def parse_args(input_args=None):
430430
default=4,
431431
help=("The dimension of the LoRA update matrices."),
432432
)
433+
434+
parser.add_argument("--lora_dropout", type=float, default=0.0, help="Dropout probability for LoRA layers")
435+
433436
parser.add_argument(
434437
"--with_prior_preservation",
435438
default=False,
@@ -1554,6 +1557,7 @@ def main(args):
15541557
transformer_lora_config = LoraConfig(
15551558
r=args.rank,
15561559
lora_alpha=args.rank,
1560+
lora_dropout=args.lora_dropout,
15571561
init_lora_weights="gaussian",
15581562
target_modules=target_modules,
15591563
)
@@ -1562,6 +1566,7 @@ def main(args):
15621566
text_lora_config = LoraConfig(
15631567
r=args.rank,
15641568
lora_alpha=args.rank,
1569+
lora_dropout=args.lora_dropout,
15651570
init_lora_weights="gaussian",
15661571
target_modules=["q_proj", "k_proj", "v_proj", "out_proj"],
15671572
)

examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -658,6 +658,8 @@ def parse_args(input_args=None):
658658
default=4,
659659
help=("The dimension of the LoRA update matrices."),
660660
)
661+
parser.add_argument("--lora_dropout", type=float, default=0.0, help="Dropout probability for LoRA layers")
662+
661663
parser.add_argument(
662664
"--use_dora",
663665
action="store_true",
@@ -673,6 +675,15 @@ def parse_args(input_args=None):
673675
default=False,
674676
help="Cache the VAE latents",
675677
)
678+
parser.add_argument(
679+
"--image_interpolation_mode",
680+
type=str,
681+
default="lanczos",
682+
choices=[
683+
f.lower() for f in dir(transforms.InterpolationMode) if not f.startswith("__") and not f.endswith("__")
684+
],
685+
help="The image interpolation method to use for resizing images.",
686+
)
676687

677688
if input_args is not None:
678689
args = parser.parse_args(input_args)
@@ -907,6 +918,10 @@ def __init__(
907918
self.num_instance_images = len(self.instance_images)
908919
self._length = self.num_instance_images
909920

921+
interpolation = getattr(transforms.InterpolationMode, args.image_interpolation_mode.upper(), None)
922+
if interpolation is None:
923+
raise ValueError(f"Unsupported interpolation mode {interpolation=}.")
924+
910925
if class_data_root is not None:
911926
self.class_data_root = Path(class_data_root)
912927
self.class_data_root.mkdir(parents=True, exist_ok=True)
@@ -921,7 +936,7 @@ def __init__(
921936

922937
self.image_transforms = transforms.Compose(
923938
[
924-
transforms.Resize(size, interpolation=transforms.InterpolationMode.BILINEAR),
939+
transforms.Resize(size, interpolation=interpolation),
925940
transforms.CenterCrop(size) if center_crop else transforms.RandomCrop(size),
926941
transforms.ToTensor(),
927942
transforms.Normalize([0.5], [0.5]),
@@ -1235,6 +1250,7 @@ def main(args):
12351250
unet_lora_config = LoraConfig(
12361251
r=args.rank,
12371252
lora_alpha=args.rank,
1253+
lora_dropout=args.lora_dropout,
12381254
use_dora=args.use_dora,
12391255
init_lora_weights="gaussian",
12401256
target_modules=["to_k", "to_q", "to_v", "to_out.0"],
@@ -1247,6 +1263,7 @@ def main(args):
12471263
text_lora_config = LoraConfig(
12481264
r=args.rank,
12491265
lora_alpha=args.rank,
1266+
lora_dropout=args.lora_dropout,
12501267
use_dora=args.use_dora,
12511268
init_lora_weights="gaussian",
12521269
target_modules=["q_proj", "k_proj", "v_proj", "out_proj"],

0 commit comments

Comments
 (0)