Skip to content

Commit b3dc1b9

Browse files
committed
resolve conflicts.
2 parents a982bdd + a9cb08a commit b3dc1b9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+3302
-272
lines changed

docs/source/en/_toctree.yml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -373,6 +373,8 @@
373373
title: QwenImageTransformer2DModel
374374
- local: api/models/sana_transformer2d
375375
title: SanaTransformer2DModel
376+
- local: api/models/sana_video_transformer3d
377+
title: SanaVideoTransformer3DModel
376378
- local: api/models/sd3_transformer2d
377379
title: SD3Transformer2DModel
378380
- local: api/models/skyreels_v2_transformer_3d
@@ -529,8 +531,6 @@
529531
title: Kandinsky 2.2
530532
- local: api/pipelines/kandinsky3
531533
title: Kandinsky 3
532-
- local: api/pipelines/kandinsky5
533-
title: Kandinsky 5
534534
- local: api/pipelines/kolors
535535
title: Kolors
536536
- local: api/pipelines/latent_consistency_models
@@ -565,6 +565,8 @@
565565
title: Sana
566566
- local: api/pipelines/sana_sprint
567567
title: Sana Sprint
568+
- local: api/pipelines/sana_video
569+
title: Sana Video
568570
- local: api/pipelines/self_attention_guidance
569571
title: Self-Attention Guidance
570572
- local: api/pipelines/semantic_stable_diffusion
@@ -638,6 +640,8 @@
638640
title: HunyuanVideo
639641
- local: api/pipelines/i2vgenxl
640642
title: I2VGen-XL
643+
- local: api/pipelines/kandinsky5_video
644+
title: Kandinsky 5.0 Video
641645
- local: api/pipelines/latte
642646
title: Latte
643647
- local: api/pipelines/ltx_video
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
<!-- Copyright 2025 The SANA-Video Authors and HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# SanaVideoTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D data (video) from [SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer](https://huggingface.co/papers/2509.24695) from NVIDIA and MIT HAN Lab, by Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie.
15+
16+
The abstract from the paper is:
17+
18+
*We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.*
19+
20+
The model can be loaded with the following code snippet.
21+
22+
```python
23+
from diffusers import SanaVideoTransformer3DModel
24+
import torch
25+
26+
transformer = SanaVideoTransformer3DModel.from_pretrained("Efficient-Large-Model/SANA-Video_2B_480p_diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)
27+
```
28+
29+
## SanaVideoTransformer3DModel
30+
31+
[[autodoc]] SanaVideoTransformer3DModel
32+
33+
## Transformer2DModelOutput
34+
35+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
36+

docs/source/en/api/pipelines/kandinsky5.md renamed to docs/source/en/api/pipelines/kandinsky5_video.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,9 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
77
specific language governing permissions and limitations under the License.
88
-->
99

10-
# Kandinsky 5.0
10+
# Kandinsky 5.0 Video
1111

12-
Kandinsky 5.0 is created by the Kandinsky team: Alexey Letunovskiy, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Dmitrii Mikhailov, Anna Averchenkova, Andrey Shutkin, Julia Agafonova, Olga Kim, Anastasiia Kargapoltseva, Nikita Kiselev, Anna Dmitrienko, Anastasia Maltseva, Kirill Chernyshev, Ilia Vasiliev, Viacheslav Vasilev, Vladimir Polovnikov, Yury Kolabushin, Alexander Belykh, Mikhail Mamaev, Anastasia Aliaskina, Tatiana Nikulina, Polina Gavrilova, Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Denis Dimitrov
12+
Kandinsky 5.0 Video is created by the Kandinsky team: Alexey Letunovskiy, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Dmitrii Mikhailov, Anna Averchenkova, Andrey Shutkin, Julia Agafonova, Olga Kim, Anastasiia Kargapoltseva, Nikita Kiselev, Anna Dmitrienko, Anastasia Maltseva, Kirill Chernyshev, Ilia Vasiliev, Viacheslav Vasilev, Vladimir Polovnikov, Yury Kolabushin, Alexander Belykh, Mikhail Mamaev, Anastasia Aliaskina, Tatiana Nikulina, Polina Gavrilova, Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Denis Dimitrov
1313

1414

1515
Kandinsky 5.0 is a family of diffusion models for Video & Image generation. Kandinsky 5.0 T2V Lite is a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class. It outperforms larger models and offers the best understanding of Russian concepts in the open-source ecosystem.
@@ -92,7 +92,7 @@ pipe = pipe.to("cuda")
9292

9393
pipe.transformer.set_attention_backend(
9494
"flex"
95-
) # <--- Set attention backend to Flex
95+
) # <--- Sett attention bakend to Flex
9696
pipe.transformer.compile(
9797
mode="max-autotune-no-cudagraphs",
9898
dynamic=True
@@ -115,7 +115,7 @@ export_to_video(output, "output.mp4", fps=24, quality=9)
115115
```
116116

117117
### Diffusion Distilled model
118-
**⚠️ Warning!** all nocfg and diffusion distilled models should be inferred without CFG (```guidance_scale=1.0```):
118+
**⚠️ Warning!** all nocfg and diffusion distilled models should be infered wothout CFG (```guidance_scale=1.0```):
119119

120120
```python
121121
model_id = "ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-5s-Diffusers"

docs/source/en/api/pipelines/sana_sprint.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,6 @@ The abstract from the paper is:
2424

2525
*This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4. We introduce three key innovations: (1) We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-time consistency distillation (sCM), eliminating costly training from scratch and achieving high training efficiency. Our hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): sCM ensures alignment with the teacher model, while LADD enhances single-step generation fidelity. (2) SANA-Sprint is a unified step-adaptive model that achieves high-quality generation in 1-4 steps, eliminating step-specific training and improving efficiency. (3) We integrate ControlNet with SANA-Sprint for real-time interactive image generation, enabling instant visual feedback for user interaction. SANA-Sprint establishes a new Pareto frontier in speed-quality tradeoffs, achieving state-of-the-art performance with 7.59 FID and 0.74 GenEval in only 1 step — outperforming FLUX-schnell (7.94 FID / 0.71 GenEval) while being 10× faster (0.1s vs 1.1s on H100). It also achieves 0.1s (T2I) and 0.25s (ControlNet) latency for 1024×1024 images on H100, and 0.31s (T2I) on an RTX 4090, showcasing its exceptional efficiency and potential for AI-powered consumer applications (AIPC). Code and pre-trained models will be open-sourced.*
2626

27-
> [!TIP]
28-
> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
29-
3027
This pipeline was contributed by [lawrence-cj](https://github.com/lawrence-cj), [shuchen Xue](https://github.com/scxue) and [Enze Xie](https://github.com/xieenze). The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model](https://huggingface.co/Efficient-Large-Model/).
3128

3229
Available models:
Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
<!-- Copyright 2025 The SANA-Video Authors and HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License. -->
14+
15+
# SanaVideoPipeline
16+
17+
<div class="flex flex-wrap space-x-1">
18+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
19+
<img alt="MPS" src="https://img.shields.io/badge/MPS-000000?style=flat&logo=apple&logoColor=white%22">
20+
</div>
21+
22+
[SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer](https://huggingface.co/papers/2509.24695) from NVIDIA and MIT HAN Lab, by Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie.
23+
24+
The abstract from the paper is:
25+
26+
*We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation. [this https URL](https://github.com/NVlabs/SANA).*
27+
28+
This pipeline was contributed by SANA Team. The original codebase can be found [here](https://github.com/NVlabs/Sana). The original weights can be found under [hf.co/Efficient-Large-Model](https://hf.co/collections/Efficient-Large-Model/sana-video).
29+
30+
Available models:
31+
32+
| Model | Recommended dtype |
33+
|:-----:|:-----------------:|
34+
| [`Efficient-Large-Model/SANA-Video_2B_480p_diffusers`](https://huggingface.co/Efficient-Large-Model/ANA-Video_2B_480p_diffusers) | `torch.bfloat16` |
35+
36+
Refer to [this](https://huggingface.co/collections/Efficient-Large-Model/sana-video) collection for more information.
37+
38+
Note: The recommended dtype mentioned is for the transformer weights. The text encoder and VAE weights must stay in `torch.bfloat16` or `torch.float32` for the model to work correctly. Please refer to the inference example below to see how to load the model with the recommended dtype.
39+
40+
## Quantization
41+
42+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
43+
44+
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`SanaVideoPipeline`] for inference with bitsandbytes.
45+
46+
```py
47+
import torch
48+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, SanaVideoTransformer3DModel, SanaVideoPipeline
49+
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, AutoModel
50+
51+
quant_config = BitsAndBytesConfig(load_in_8bit=True)
52+
text_encoder_8bit = AutoModel.from_pretrained(
53+
"Efficient-Large-Model/SANA-Video_2B_480p_diffusers",
54+
subfolder="text_encoder",
55+
quantization_config=quant_config,
56+
torch_dtype=torch.float16,
57+
)
58+
59+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
60+
transformer_8bit = SanaVideoTransformer3DModel.from_pretrained(
61+
"Efficient-Large-Model/SANA-Video_2B_480p_diffusers",
62+
subfolder="transformer",
63+
quantization_config=quant_config,
64+
torch_dtype=torch.float16,
65+
)
66+
67+
pipeline = SanaVideoPipeline.from_pretrained(
68+
"Efficient-Large-Model/SANA-Video_2B_480p_diffusers",
69+
text_encoder=text_encoder_8bit,
70+
transformer=transformer_8bit,
71+
torch_dtype=torch.float16,
72+
device_map="balanced",
73+
)
74+
75+
model_score = 30
76+
prompt = "Evening, backlight, side lighting, soft light, high contrast, mid-shot, centered composition, clean solo shot, warm color. A young Caucasian man stands in a forest, golden light glimmers on his hair as sunlight filters through the leaves. He wears a light shirt, wind gently blowing his hair and collar, light dances across his face with his movements. The background is blurred, with dappled light and soft tree shadows in the distance. The camera focuses on his lifted gaze, clear and emotional."
77+
negative_prompt = "A chaotic sequence with misshapen, deformed limbs in heavy motion blur, sudden disappearance, jump cuts, jerky movements, rapid shot changes, frames out of sync, inconsistent character shapes, temporal artifacts, jitter, and ghosting effects, creating a disorienting visual experience."
78+
motion_prompt = f" motion score: {model_score}."
79+
prompt = prompt + motion_prompt
80+
81+
output = pipeline(
82+
prompt=prompt,
83+
negative_prompt=negative_prompt,
84+
height=480,
85+
width=832,
86+
num_frames=81,
87+
guidance_scale=6.0,
88+
num_inference_steps=50
89+
).frames[0]
90+
export_to_video(output, "sana-video-output.mp4", fps=16)
91+
```
92+
93+
## SanaVideoPipeline
94+
95+
[[autodoc]] SanaVideoPipeline
96+
- all
97+
- __call__
98+
99+
100+
## SanaVideoPipelineOutput
101+
102+
[[autodoc]] pipelines.sana.pipeline_sana_video.SanaVideoPipelineOutput

examples/unconditional_image_generation/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,8 @@ To use your own dataset, there are 2 ways:
104104
- you can either provide your own folder as `--train_data_dir`
105105
- or you can upload your dataset to the hub (possibly as a private repo, if you prefer so), and simply pass the `--dataset_name` argument.
106106

107+
If your dataset contains 16 or 32-bit channels (for example, medical TIFFs), add the `--preserve_input_precision` flag so the preprocessing keeps the original precision while still training a 3-channel model. Precision still depends on the decoder: Pillow keeps 16-bit grayscale and float inputs, but many 16-bit RGB files are decoded as 8-bit RGB, and the flag cannot recover precision lost at load time.
108+
107109
Below, we explain both in more detail.
108110

109111
#### Provide the dataset as a folder

examples/unconditional_image_generation/train_unconditional.py

Lines changed: 51 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,24 @@ def _extract_into_tensor(arr, timesteps, broadcast_shape):
5252
return res.expand(broadcast_shape)
5353

5454

55+
def _ensure_three_channels(tensor: torch.Tensor) -> torch.Tensor:
56+
"""
57+
Ensure the tensor has exactly three channels (C, H, W) by repeating or truncating channels when needed.
58+
"""
59+
if tensor.ndim == 2:
60+
tensor = tensor.unsqueeze(0)
61+
channels = tensor.shape[0]
62+
if channels == 3:
63+
return tensor
64+
if channels == 1:
65+
return tensor.repeat(3, 1, 1)
66+
if channels == 2:
67+
return torch.cat([tensor, tensor[:1]], dim=0)
68+
if channels > 3:
69+
return tensor[:3]
70+
raise ValueError(f"Unsupported number of channels: {channels}")
71+
72+
5573
def parse_args():
5674
parser = argparse.ArgumentParser(description="Simple example of a training script.")
5775
parser.add_argument(
@@ -260,6 +278,11 @@ def parse_args():
260278
parser.add_argument(
261279
"--enable_xformers_memory_efficient_attention", action="store_true", help="Whether or not to use xformers."
262280
)
281+
parser.add_argument(
282+
"--preserve_input_precision",
283+
action="store_true",
284+
help="Preserve 16/32-bit image precision by avoiding 8-bit RGB conversion while still producing 3-channel tensors.",
285+
)
263286

264287
args = parser.parse_args()
265288
env_local_rank = int(os.environ.get("LOCAL_RANK", -1))
@@ -453,19 +476,41 @@ def load_model_hook(models, input_dir):
453476
# https://huggingface.co/docs/datasets/v2.4.0/en/image_load#imagefolder
454477

455478
# Preprocessing the datasets and DataLoaders creation.
479+
spatial_augmentations = [
480+
transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR),
481+
transforms.CenterCrop(args.resolution) if args.center_crop else transforms.RandomCrop(args.resolution),
482+
transforms.RandomHorizontalFlip() if args.random_flip else transforms.Lambda(lambda x: x),
483+
]
484+
456485
augmentations = transforms.Compose(
457-
[
458-
transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR),
459-
transforms.CenterCrop(args.resolution) if args.center_crop else transforms.RandomCrop(args.resolution),
460-
transforms.RandomHorizontalFlip() if args.random_flip else transforms.Lambda(lambda x: x),
486+
spatial_augmentations
487+
+ [
461488
transforms.ToTensor(),
462489
transforms.Normalize([0.5], [0.5]),
463490
]
464491
)
465492

493+
precision_augmentations = transforms.Compose(
494+
[
495+
transforms.PILToTensor(),
496+
transforms.Lambda(_ensure_three_channels),
497+
transforms.ConvertImageDtype(torch.float32),
498+
]
499+
+ spatial_augmentations
500+
+ [transforms.Normalize([0.5], [0.5])]
501+
)
502+
466503
def transform_images(examples):
467-
images = [augmentations(image.convert("RGB")) for image in examples["image"]]
468-
return {"input": images}
504+
processed = []
505+
for image in examples["image"]:
506+
if not args.preserve_input_precision:
507+
processed.append(augmentations(image.convert("RGB")))
508+
else:
509+
precise_image = image
510+
if precise_image.mode == "P":
511+
precise_image = precise_image.convert("RGB")
512+
processed.append(precision_augmentations(precise_image))
513+
return {"input": processed}
469514

470515
logger.info(f"Dataset size: {len(dataset)}")
471516

0 commit comments

Comments
 (0)