Skip to content

Commit fc2f124

Browse files
authored
Merge branch 'main' into hunyuan-video
2 parents ce7b0b9 + 8957324 commit fc2f124

37 files changed

+3965
-76
lines changed

.github/workflows/push_tests.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,8 @@ jobs:
165165
group: gcp-ct5lp-hightpu-8t
166166
container:
167167
image: diffusers/diffusers-flax-tpu
168-
options: --shm-size "16gb" --ipc host --privileged ${{ vars.V5_LITEPOD_8_ENV}} -v /mnt/hf_cache:/mnt/hf_cache defaults:
168+
options: --shm-size "16gb" --ipc host --privileged ${{ vars.V5_LITEPOD_8_ENV}} -v /mnt/hf_cache:/mnt/hf_cache
169+
defaults:
169170
run:
170171
shell: bash
171172
steps:

docs/source/en/_toctree.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -286,6 +286,8 @@
286286
title: PriorTransformer
287287
- local: api/models/sd3_transformer2d
288288
title: SD3Transformer2DModel
289+
- local: api/models/sana_transformer2d
290+
title: SanaTransformer2DModel
289291
- local: api/models/stable_audio_transformer
290292
title: StableAudioDiTModel
291293
- local: api/models/transformer2d
@@ -440,6 +442,8 @@
440442
title: PixArt-α
441443
- local: api/pipelines/pixart_sigma
442444
title: PixArt-Σ
445+
- local: api/pipelines/sana
446+
title: Sana
443447
- local: api/pipelines/self_attention_guidance
444448
title: Self-Attention Guidance
445449
- local: api/pipelines/semantic_stable_diffusion
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# SanaTransformer2DModel
13+
14+
A Diffusion Transformer model for 2D data from [SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers](https://huggingface.co/papers/2410.10629) was introduced from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han.
15+
16+
The abstract from the paper is:
17+
18+
*We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.*
19+
20+
The model can be loaded with the following code snippet.
21+
22+
```python
23+
from diffusers import SanaTransformer2DModel
24+
25+
transformer = SanaTransformer2DModel.from_pretrained("Efficient-Large-Model/Sana_1600M_1024px_diffusers", subfolder="transformer", torch_dtype=torch.float16)
26+
```
27+
28+
## SanaTransformer2DModel
29+
30+
[[autodoc]] SanaTransformer2DModel
31+
32+
## Transformer2DModelOutput
33+
34+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput

docs/source/en/api/pipelines/ltx_video.md

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -31,26 +31,38 @@ import torch
3131
from diffusers import AutoencoderKLLTXVideo, LTXImageToVideoPipeline, LTXVideoTransformer3DModel
3232

3333
single_file_url = "https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.safetensors"
34-
transformer = LTXVideoTransformer3DModel.from_single_file(single_file_url, torch_dtype=torch.bfloat16)
34+
transformer = LTXVideoTransformer3DModel.from_single_file(
35+
single_file_url, torch_dtype=torch.bfloat16
36+
)
3537
vae = AutoencoderKLLTXVideo.from_single_file(single_file_url, torch_dtype=torch.bfloat16)
36-
pipe = LTXImageToVideoPipeline.from_pretrained("Lightricks/LTX-Video", transformer=transformer, vae=vae, torch_dtype=torch.bfloat16)
38+
pipe = LTXImageToVideoPipeline.from_pretrained(
39+
"Lightricks/LTX-Video", transformer=transformer, vae=vae, torch_dtype=torch.bfloat16
40+
)
3741

3842
# ... inference code ...
3943
```
4044

41-
Alternatively, the pipeline can be used to load the weights with [~FromSingleFileMixin.from_single_file`].
45+
Alternatively, the pipeline can be used to load the weights with [`~FromSingleFileMixin.from_single_file`].
4246

4347
```python
4448
import torch
4549
from diffusers import LTXImageToVideoPipeline
4650
from transformers import T5EncoderModel, T5Tokenizer
4751

4852
single_file_url = "https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.safetensors"
49-
text_encoder = T5EncoderModel.from_pretrained("Lightricks/LTX-Video", subfolder="text_encoder", torch_dtype=torch.bfloat16)
50-
tokenizer = T5Tokenizer.from_pretrained("Lightricks/LTX-Video", subfolder="tokenizer", torch_dtype=torch.bfloat16)
51-
pipe = LTXImageToVideoPipeline.from_single_file(single_file_url, text_encoder=text_encoder, tokenizer=tokenizer, torch_dtype=torch.bfloat16)
53+
text_encoder = T5EncoderModel.from_pretrained(
54+
"Lightricks/LTX-Video", subfolder="text_encoder", torch_dtype=torch.bfloat16
55+
)
56+
tokenizer = T5Tokenizer.from_pretrained(
57+
"Lightricks/LTX-Video", subfolder="tokenizer", torch_dtype=torch.bfloat16
58+
)
59+
pipe = LTXImageToVideoPipeline.from_single_file(
60+
single_file_url, text_encoder=text_encoder, tokenizer=tokenizer, torch_dtype=torch.bfloat16
61+
)
5262
```
5363

64+
Refer to [this section](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox#memory-optimization) to learn more about optimizing memory consumption.
65+
5466
## LTXPipeline
5567

5668
[[autodoc]] LTXPipeline
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License. -->
14+
15+
# SanaPipeline
16+
17+
[SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers](https://huggingface.co/papers/2410.10629) from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han.
18+
19+
The abstract from the paper is:
20+
21+
*We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096×4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.*
22+
23+
<Tip>
24+
25+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
26+
27+
</Tip>
28+
29+
This pipeline was contributed by [lawrence-cj](https://github.com/lawrence-cj) and [chenjy2003](https://github.com/chenjy2003). The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model](https://huggingface.co/Efficient-Large-Model).
30+
31+
Available models:
32+
33+
| Model | Recommended dtype |
34+
|:-----:|:-----------------:|
35+
| [`Efficient-Large-Model/Sana_1600M_1024px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_diffusers) | `torch.float16` |
36+
| [`Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_MultiLing_diffusers) | `torch.float16` |
37+
| [`Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_1024px_BF16_diffusers) | `torch.bfloat16` |
38+
| [`Efficient-Large-Model/Sana_1600M_512px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px_diffusers) | `torch.float16` |
39+
| [`Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px_MultiLing_diffusers) | `torch.float16` |
40+
| [`Efficient-Large-Model/Sana_600M_1024px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px_diffusers) | `torch.float16` |
41+
| [`Efficient-Large-Model/Sana_600M_512px_diffusers`](https://huggingface.co/Efficient-Large-Model/Sana_600M_512px_diffusers) | `torch.float16` |
42+
43+
Refer to [this](https://huggingface.co/collections/Efficient-Large-Model/sana-673efba2a57ed99843f11f9e) collection for more information.
44+
45+
<Tip>
46+
47+
Make sure to pass the `variant` argument for downloaded checkpoints to use lower disk space. Set it to `"fp16"` for models with recommended dtype as `torch.float16`, and `"bf16"` for models with recommended dtype as `torch.bfloat16`. By default, `torch.float32` weights are downloaded, which use twice the amount of disk storage. Additionally, `torch.float32` weights can be downcasted on-the-fly by specifying the `torch_dtype` argument. Read about it in the [docs](https://huggingface.co/docs/diffusers/v0.31.0/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained).
48+
49+
</Tip>
50+
51+
## SanaPipeline
52+
53+
[[autodoc]] SanaPipeline
54+
- all
55+
- __call__
56+
57+
## SanaPAGPipeline
58+
59+
[[autodoc]] SanaPAGPipeline
60+
- all
61+
- __call__
62+
63+
## SanaPipelineOutput
64+
65+
[[autodoc]] pipelines.sana.pipeline_output.SanaPipelineOutput

examples/flux-control/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ accelerate launch train_control_lora_flux.py \
3636
--max_train_steps=5000 \
3737
--validation_image="openpose.png" \
3838
--validation_prompt="A couple, 4k photo, highly detailed" \
39+
--offload \
3940
--seed="0" \
4041
--push_to_hub
4142
```
@@ -154,6 +155,7 @@ accelerate launch --config_file=accelerate_ds2.yaml train_control_flux.py \
154155
--validation_steps=200 \
155156
--validation_image "2_pose_1024.jpg" "3_pose_1024.jpg" \
156157
--validation_prompt "two friends sitting by each other enjoying a day at the park, full hd, cinematic" "person enjoying a day at the park, full hd, cinematic" \
158+
--offload \
157159
--seed="0" \
158160
--push_to_hub
159161
```

examples/flux-control/train_control_flux.py

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -541,6 +541,11 @@ def parse_args(input_args=None):
541541
default=1.29,
542542
help="Scale of mode weighting scheme. Only effective when using the `'mode'` as the `weighting_scheme`.",
543543
)
544+
parser.add_argument(
545+
"--offload",
546+
action="store_true",
547+
help="Whether to offload the VAE and the text encoders to CPU when they are not used.",
548+
)
544549

545550
if input_args is not None:
546551
args = parser.parse_args(input_args)
@@ -999,8 +1004,9 @@ def get_sigmas(timesteps, n_dim=4, dtype=torch.float32):
9991004
control_latents = encode_images(
10001005
batch["conditioning_pixel_values"], vae.to(accelerator.device), weight_dtype
10011006
)
1002-
# offload vae to CPU.
1003-
vae.cpu()
1007+
if args.offload:
1008+
# offload vae to CPU.
1009+
vae.cpu()
10041010

10051011
# Sample a random timestep for each image
10061012
# for weighting schemes where we sample timesteps non-uniformly
@@ -1064,7 +1070,8 @@ def get_sigmas(timesteps, n_dim=4, dtype=torch.float32):
10641070
if args.proportion_empty_prompts and random.random() < args.proportion_empty_prompts:
10651071
prompt_embeds.zero_()
10661072
pooled_prompt_embeds.zero_()
1067-
text_encoding_pipeline = text_encoding_pipeline.to("cpu")
1073+
if args.offload:
1074+
text_encoding_pipeline = text_encoding_pipeline.to("cpu")
10681075

10691076
# Predict.
10701077
model_pred = flux_transformer(

examples/flux-control/train_control_lora_flux.py

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -573,6 +573,11 @@ def parse_args(input_args=None):
573573
default=1.29,
574574
help="Scale of mode weighting scheme. Only effective when using the `'mode'` as the `weighting_scheme`.",
575575
)
576+
parser.add_argument(
577+
"--offload",
578+
action="store_true",
579+
help="Whether to offload the VAE and the text encoders to CPU when they are not used.",
580+
)
576581

577582
if input_args is not None:
578583
args = parser.parse_args(input_args)
@@ -1140,8 +1145,10 @@ def get_sigmas(timesteps, n_dim=4, dtype=torch.float32):
11401145
control_latents = encode_images(
11411146
batch["conditioning_pixel_values"], vae.to(accelerator.device), weight_dtype
11421147
)
1143-
# offload vae to CPU.
1144-
vae.cpu()
1148+
1149+
if args.offload:
1150+
# offload vae to CPU.
1151+
vae.cpu()
11451152

11461153
# Sample a random timestep for each image
11471154
# for weighting schemes where we sample timesteps non-uniformly
@@ -1205,7 +1212,8 @@ def get_sigmas(timesteps, n_dim=4, dtype=torch.float32):
12051212
if args.proportion_empty_prompts and random.random() < args.proportion_empty_prompts:
12061213
prompt_embeds.zero_()
12071214
pooled_prompt_embeds.zero_()
1208-
text_encoding_pipeline = text_encoding_pipeline.to("cpu")
1215+
if args.offload:
1216+
text_encoding_pipeline = text_encoding_pipeline.to("cpu")
12091217

12101218
# Predict.
12111219
model_pred = flux_transformer(

0 commit comments

Comments
 (0)