Skip to content

Commit 73d8144

Browse files
committed
Merge branch 'main' into Add-AnyText
2 parents 936c2ff + 2dad462 commit 73d8144

File tree

58 files changed

+8575
-87
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

58 files changed

+8575
-87
lines changed

docs/source/en/_toctree.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,6 +239,8 @@
239239
title: VQModel
240240
- local: api/models/autoencoderkl
241241
title: AutoencoderKL
242+
- local: api/models/autoencoderkl_cogvideox
243+
title: AutoencoderKLCogVideoX
242244
- local: api/models/asymmetricautoencoderkl
243245
title: AsymmetricAutoencoderKL
244246
- local: api/models/stable_cascade_unet
@@ -263,6 +265,8 @@
263265
title: FluxTransformer2DModel
264266
- local: api/models/latte_transformer3d
265267
title: LatteTransformer3DModel
268+
- local: api/models/cogvideox_transformer3d
269+
title: CogVideoXTransformer3DModel
266270
- local: api/models/lumina_nextdit2d
267271
title: LuminaNextDiT2DModel
268272
- local: api/models/transformer_temporal
@@ -302,6 +306,8 @@
302306
title: AutoPipeline
303307
- local: api/pipelines/blip_diffusion
304308
title: BLIP-Diffusion
309+
- local: api/pipelines/cogvideox
310+
title: CogVideoX
305311
- local: api/pipelines/consistency_models
306312
title: Consistency Models
307313
- local: api/pipelines/controlnet

docs/source/en/api/loaders/single_file.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load:
2222
2323
## Supported pipelines
2424

25+
- [`CogVideoXPipeline`]
2526
- [`StableDiffusionPipeline`]
2627
- [`StableDiffusionImg2ImgPipeline`]
2728
- [`StableDiffusionInpaintPipeline`]
@@ -49,6 +50,7 @@ The [`~loaders.FromSingleFileMixin.from_single_file`] method allows you to load:
4950
- [`UNet2DConditionModel`]
5051
- [`StableCascadeUNet`]
5152
- [`AutoencoderKL`]
53+
- [`AutoencoderKLCogVideoX`]
5254
- [`ControlNetModel`]
5355
- [`SD3Transformer2DModel`]
5456
- [`FluxTransformer2DModel`]
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# AutoencoderKLCogVideoX
13+
14+
The 3D variational autoencoder (VAE) model with KL loss used in [CogVideoX](https://github.com/THUDM/CogVideo) was introduced in [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) by Tsinghua University & ZhipuAI.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import AutoencoderKLCogVideoX
20+
21+
vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-2b", subfolder="vae", torch_dtype=torch.float16).to("cuda")
22+
```
23+
24+
## AutoencoderKLCogVideoX
25+
26+
[[autodoc]] AutoencoderKLCogVideoX
27+
- decode
28+
- encode
29+
- all
30+
31+
## AutoencoderKLOutput
32+
33+
[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput
34+
35+
## DecoderOutput
36+
37+
[[autodoc]] models.autoencoders.vae.DecoderOutput
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# CogVideoXTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D data from [CogVideoX](https://github.com/THUDM/CogVideo) was introduced in [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) by Tsinghua University & ZhipuAI.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import CogVideoXTransformer3DModel
20+
21+
vae = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-2b", subfolder="transformer", torch_dtype=torch.float16).to("cuda")
22+
```
23+
24+
## CogVideoXTransformer3DModel
25+
26+
[[autodoc]] CogVideoXTransformer3DModel
27+
28+
## Transformer2DModelOutput
29+
30+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
-->
15+
16+
# CogVideoX
17+
18+
<!-- TODO: update paper with ArXiv link when ready. -->
19+
20+
[CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf) from Tsinghua University & ZhipuAI.
21+
22+
The abstract from the paper is:
23+
24+
*We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compresses videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motion. In addition, we develop an effectively text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of CogVideoX-2B is publicly available at https://github.com/THUDM/CogVideo.*
25+
26+
<Tip>
27+
28+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
29+
30+
</Tip>
31+
32+
This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM).
33+
34+
## Inference
35+
36+
Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency.
37+
38+
First, load the pipeline:
39+
40+
```python
41+
import torch
42+
from diffusers import CogVideoXPipeline
43+
from diffusers.utils import export_to_video
44+
45+
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b").to("cuda")
46+
prompt = (
47+
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
48+
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
49+
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
50+
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
51+
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
52+
"atmosphere of this unique musical performance."
53+
)
54+
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
55+
export_to_video(video, "output.mp4", fps=8)
56+
```
57+
58+
Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`:
59+
60+
```python
61+
pipeline.transformer.to(memory_format=torch.channels_last)
62+
pipeline.vae.to(memory_format=torch.channels_last)
63+
```
64+
65+
Finally, compile the components and run inference:
66+
67+
```python
68+
pipeline.transformer = torch.compile(pipeline.transformer)
69+
pipeline.vae.decode = torch.compile(pipeline.vae.decode)
70+
71+
# CogVideoX works very well with long and well-described prompts
72+
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
73+
video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
74+
```
75+
76+
The [benchmark](TODO: link) results on an 80GB A100 machine are:
77+
78+
```
79+
Without torch.compile(): Average inference time: TODO seconds.
80+
With torch.compile(): Average inference time: TODO seconds.
81+
```
82+
83+
## CogVideoXPipeline
84+
85+
[[autodoc]] CogVideoXPipeline
86+
- all
87+
- __call__
88+
89+
## CogVideoXPipelineOutput
90+
91+
[[autodoc]] pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipelineOutput

docs/source/en/api/pipelines/flux.md

Lines changed: 84 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ Both checkpoints have slightly difference usage which we detail below.
3737

3838
```python
3939
import torch
40-
from diffusers import FluxPipeline
40+
from diffusers import FluxPipeline
4141

4242
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16)
4343
pipe.enable_model_cpu_offload()
@@ -61,7 +61,7 @@ out.save("image.png")
6161

6262
```python
6363
import torch
64-
from diffusers import FluxPipeline
64+
from diffusers import FluxPipeline
6565

6666
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16)
6767
pipe.enable_model_cpu_offload()
@@ -77,8 +77,89 @@ out = pipe(
7777
out.save("image.png")
7878
```
7979

80+
## Running FP16 inference
81+
Flux can generate high-quality images with FP16 (i.e. to accelerate inference on Turing/Volta GPUs) but produces different outputs compared to FP32/BF16. The issue is that some activations in the text encoders have to be clipped when running in FP16, which affects the overall image. Forcing text encoders to run with FP32 inference thus removes this output difference. See [here](https://github.com/huggingface/diffusers/pull/9097#issuecomment-2272292516) for details.
82+
83+
FP16 inference code:
84+
```python
85+
import torch
86+
from diffusers import FluxPipeline
87+
88+
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16) # can replace schnell with dev
89+
# to run on low vram GPUs (i.e. between 4 and 32 GB VRAM)
90+
pipe.enable_sequential_cpu_offload()
91+
pipe.vae.enable_slicing()
92+
pipe.vae.enable_tiling()
93+
94+
pipe.to(torch.float16) # casting here instead of in the pipeline constructor because doing so in the constructor loads all models into CPU memory at once
95+
96+
prompt = "A cat holding a sign that says hello world"
97+
out = pipe(
98+
prompt=prompt,
99+
guidance_scale=0.,
100+
height=768,
101+
width=1360,
102+
num_inference_steps=4,
103+
max_sequence_length=256,
104+
).images[0]
105+
out.save("image.png")
106+
```
107+
108+
## Single File Loading for the `FluxTransformer2DModel`
109+
110+
The `FluxTransformer2DModel` supports loading checkpoints in the original format shipped by Black Forest Labs. This is also useful when trying to load finetunes or quantized versions of the models that have been published by the community.
111+
112+
<Tip>
113+
`FP8` inference can be brittle depending on the GPU type, CUDA version, and `torch` version that you are using. It is recommended that you use the `optimum-quanto` library in order to run FP8 inference on your machine.
114+
</Tip>
115+
116+
The following example demonstrates how to run Flux with less than 16GB of VRAM.
117+
118+
First install `optimum-quanto`
119+
120+
```shell
121+
pip install optimum-quanto
122+
```
123+
124+
Then run the following example
125+
126+
```python
127+
import torch
128+
from diffusers import FluxTransformer2DModel, FluxPipeline
129+
from transformers import T5EncoderModel, CLIPTextModel
130+
from optimum.quanto import freeze, qfloat8, quantize
131+
132+
bfl_repo = "black-forest-labs/FLUX.1-dev"
133+
dtype = torch.bfloat16
134+
135+
transformer = FluxTransformer2DModel.from_single_file("https://huggingface.co/Kijai/flux-fp8/blob/main/flux1-dev-fp8.safetensors", torch_dtype=dtype)
136+
quantize(transformer, weights=qfloat8)
137+
freeze(transformer)
138+
139+
text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype)
140+
quantize(text_encoder_2, weights=qfloat8)
141+
freeze(text_encoder_2)
142+
143+
pipe = FluxPipeline.from_pretrained(bfl_repo, transformer=None, text_encoder_2=None, torch_dtype=dtype)
144+
pipe.transformer = transformer
145+
pipe.text_encoder_2 = text_encoder_2
146+
147+
pipe.enable_model_cpu_offload()
148+
149+
prompt = "A cat holding a sign that says hello world"
150+
image = pipe(
151+
prompt,
152+
guidance_scale=3.5,
153+
output_type="pil",
154+
num_inference_steps=20,
155+
generator=torch.Generator("cpu").manual_seed(0)
156+
).images[0]
157+
158+
image.save("flux-fp8-dev.png")
159+
```
160+
80161
## FluxPipeline
81162

82163
[[autodoc]] FluxPipeline
83164
- all
84-
- __call__
165+
- __call__

docs/source/en/api/pipelines/pag.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,11 @@ Since RegEx is supported as a way for matching layer identifiers, it is crucial
4343
- all
4444
- __call__
4545

46+
## KolorsPAGPipeline
47+
[[autodoc]] KolorsPAGPipeline
48+
- all
49+
- __call__
50+
4651
## StableDiffusionPAGPipeline
4752
[[autodoc]] StableDiffusionPAGPipeline
4853
- all
@@ -74,6 +79,12 @@ Since RegEx is supported as a way for matching layer identifiers, it is crucial
7479
- __call__
7580

7681

82+
## StableDiffusion3PAGPipeline
83+
[[autodoc]] StableDiffusion3PAGPipeline
84+
- all
85+
- __call__
86+
87+
7788
## PixArtSigmaPAGPipeline
7889
[[autodoc]] PixArtSigmaPAGPipeline
7990
- all

examples/dreambooth/train_dreambooth_lora_sd3.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1271,7 +1271,7 @@ def load_model_hook(models, input_dir):
12711271
lora_state_dict = StableDiffusion3Pipeline.lora_state_dict(input_dir)
12721272

12731273
transformer_state_dict = {
1274-
f'{k.replace("transformer.", "")}': v for k, v in lora_state_dict.items() if k.startswith("unet.")
1274+
f'{k.replace("transformer.", "")}': v for k, v in lora_state_dict.items() if k.startswith("transformer.")
12751275
}
12761276
transformer_state_dict = convert_unet_state_dict_to_peft(transformer_state_dict)
12771277
incompatible_keys = set_peft_model_state_dict(transformer_, transformer_state_dict, adapter_name="default")

0 commit comments

Comments
 (0)