Skip to content

Commit 66704ac

Browse files
committed
up
2 parents 9db988a + 55d49d4 commit 66704ac

36 files changed

+828
-95
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -525,6 +525,8 @@
525525
title: Kandinsky 2.2
526526
- local: api/pipelines/kandinsky3
527527
title: Kandinsky 3
528+
- local: api/pipelines/kandinsky5
529+
title: Kandinsky 5
528530
- local: api/pipelines/kolors
529531
title: Kolors
530532
- local: api/pipelines/latent_consistency_models

docs/source/en/api/models/chroma_transformer.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
1212

1313
# ChromaTransformer2DModel
1414

15-
A modified flux Transformer model from [Chroma](https://huggingface.co/lodestones/Chroma)
15+
A modified flux Transformer model from [Chroma](https://huggingface.co/lodestones/Chroma1-HD)
1616

1717
## ChromaTransformer2DModel
1818

docs/source/en/api/pipelines/chroma.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,20 +19,21 @@ specific language governing permissions and limitations under the License.
1919

2020
Chroma is a text to image generation model based on Flux.
2121

22-
Original model checkpoints for Chroma can be found [here](https://huggingface.co/lodestones/Chroma).
22+
Original model checkpoints for Chroma can be found here:
23+
* High-resolution finetune: [lodestones/Chroma1-HD](https://huggingface.co/lodestones/Chroma1-HD)
24+
* Base model: [lodestones/Chroma1-Base](https://huggingface.co/lodestones/Chroma1-Base)
25+
* Original repo with progress checkpoints: [lodestones/Chroma](https://huggingface.co/lodestones/Chroma) (loading this repo with `from_pretrained` will load a Diffusers-compatible version of the `unlocked-v37` checkpoint)
2326

2427
> [!TIP]
2528
> Chroma can use all the same optimizations as Flux.
2629
2730
## Inference
2831

29-
The Diffusers version of Chroma is based on the [`unlocked-v37`](https://huggingface.co/lodestones/Chroma/blob/main/chroma-unlocked-v37.safetensors) version of the original model, which is available in the [Chroma repository](https://huggingface.co/lodestones/Chroma).
30-
3132
```python
3233
import torch
3334
from diffusers import ChromaPipeline
3435

35-
pipe = ChromaPipeline.from_pretrained("lodestones/Chroma", torch_dtype=torch.bfloat16)
36+
pipe = ChromaPipeline.from_pretrained("lodestones/Chroma1-HD", torch_dtype=torch.bfloat16)
3637
pipe.enable_model_cpu_offload()
3738

3839
prompt = [
@@ -63,10 +64,10 @@ Then run the following example
6364
import torch
6465
from diffusers import ChromaTransformer2DModel, ChromaPipeline
6566

66-
model_id = "lodestones/Chroma"
67+
model_id = "lodestones/Chroma1-HD"
6768
dtype = torch.bfloat16
6869

69-
transformer = ChromaTransformer2DModel.from_single_file("https://huggingface.co/lodestones/Chroma/blob/main/chroma-unlocked-v37.safetensors", torch_dtype=dtype)
70+
transformer = ChromaTransformer2DModel.from_single_file("https://huggingface.co/lodestones/Chroma1-HD/blob/main/Chroma1-HD.safetensors", torch_dtype=dtype)
7071

7172
pipe = ChromaPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=dtype)
7273
pipe.enable_model_cpu_offload()
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
3+
the License. You may obtain a copy of the License at
4+
http://www.apache.org/licenses/LICENSE-2.0
5+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
6+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
7+
specific language governing permissions and limitations under the License.
8+
-->
9+
10+
# Kandinsky 5.0
11+
12+
Kandinsky 5.0 is created by the Kandinsky team: Alexey Letunovskiy, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Dmitrii Mikhailov, Anna Averchenkova, Andrey Shutkin, Julia Agafonova, Olga Kim, Anastasiia Kargapoltseva, Nikita Kiselev, Anna Dmitrienko, Anastasia Maltseva, Kirill Chernyshev, Ilia Vasiliev, Viacheslav Vasilev, Vladimir Polovnikov, Yury Kolabushin, Alexander Belykh, Mikhail Mamaev, Anastasia Aliaskina, Tatiana Nikulina, Polina Gavrilova, Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Denis Dimitrov
13+
14+
15+
Kandinsky 5.0 is a family of diffusion models for Video & Image generation. Kandinsky 5.0 T2V Lite is a lightweight video generation model (2B parameters) that ranks #1 among open-source models in its class. It outperforms larger models and offers the best understanding of Russian concepts in the open-source ecosystem.
16+
17+
The model introduces several key innovations:
18+
- **Latent diffusion pipeline** with **Flow Matching** for improved training stability
19+
- **Diffusion Transformer (DiT)** as the main generative backbone with cross-attention to text embeddings
20+
- Dual text encoding using **Qwen2.5-VL** and **CLIP** for comprehensive text understanding
21+
- **HunyuanVideo 3D VAE** for efficient video encoding and decoding
22+
- **Sparse attention mechanisms** (NABLA) for efficient long-sequence processing
23+
24+
The original codebase can be found at [ai-forever/Kandinsky-5](https://github.com/ai-forever/Kandinsky-5).
25+
26+
> [!TIP]
27+
> Check out the [AI Forever](https://huggingface.co/ai-forever) organization on the Hub for the official model checkpoints for text-to-video generation, including pretrained, SFT, no-CFG, and distilled variants.
28+
29+
## Available Models
30+
31+
Kandinsky 5.0 T2V Lite comes in several variants optimized for different use cases:
32+
33+
| model_id | Description | Use Cases |
34+
|------------|-------------|-----------|
35+
| **ai-forever/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers** | 5 second Supervised Fine-Tuned model | Highest generation quality |
36+
| **ai-forever/Kandinsky-5.0-T2V-Lite-sft-10s-Diffusers** | 10 second Supervised Fine-Tuned model | Highest generation quality |
37+
| **ai-forever/Kandinsky-5.0-T2V-Lite-nocfg-5s-Diffusers** | 5 second Classifier-Free Guidance distilled | 2× faster inference |
38+
| **ai-forever/Kandinsky-5.0-T2V-Lite-nocfg-10s-Diffusers** | 10 second Classifier-Free Guidance distilled | 2× faster inference |
39+
| **ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-5s-Diffusers** | 5 second Diffusion distilled to 16 steps | 6× faster inference, minimal quality loss |
40+
| **ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-10s-Diffusers** | 10 second Diffusion distilled to 16 steps | 6× faster inference, minimal quality loss |
41+
| **ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-5s-Diffusers** | 5 second Base pretrained model | Research and fine-tuning |
42+
| **ai-forever/Kandinsky-5.0-T2V-Lite-pretrain-10s-Diffusers** | 10 second Base pretrained model | Research and fine-tuning |
43+
44+
All models are available in 5-second and 10-second video generation versions.
45+
46+
## Kandinsky5T2VPipeline
47+
48+
[[autodoc]] Kandinsky5T2VPipeline
49+
- all
50+
- __call__
51+
52+
## Usage Examples
53+
54+
### Basic Text-to-Video Generation
55+
56+
```python
57+
import torch
58+
from diffusers import Kandinsky5T2VPipeline
59+
from diffusers.utils import export_to_video
60+
61+
# Load the pipeline
62+
model_id = "ai-forever/Kandinsky-5.0-T2V-Lite-sft-5s-Diffusers"
63+
pipe = Kandinsky5T2VPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
64+
pipe = pipe.to("cuda")
65+
66+
# Generate video
67+
prompt = "A cat and a dog baking a cake together in a kitchen."
68+
negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards"
69+
70+
output = pipe(
71+
prompt=prompt,
72+
negative_prompt=negative_prompt,
73+
height=512,
74+
width=768,
75+
num_frames=121, # ~5 seconds at 24fps
76+
num_inference_steps=50,
77+
guidance_scale=5.0,
78+
).frames[0]
79+
80+
export_to_video(output, "output.mp4", fps=24, quality=9)
81+
```
82+
83+
### 10 second Models
84+
**⚠️ Warning!** all 10 second models should be used with Flex attention and max-autotune-no-cudagraphs compilation:
85+
86+
```python
87+
pipe = Kandinsky5T2VPipeline.from_pretrained(
88+
"ai-forever/Kandinsky-5.0-T2V-Lite-sft-10s-Diffusers",
89+
torch_dtype=torch.bfloat16
90+
)
91+
pipe = pipe.to("cuda")
92+
93+
pipe.transformer.set_attention_backend(
94+
"flex"
95+
) # <--- Set attention backend to Flex
96+
pipe.transformer.compile(
97+
mode="max-autotune-no-cudagraphs",
98+
dynamic=True
99+
) # <--- Compile with max-autotune-no-cudagraphs
100+
101+
prompt = "A cat and a dog baking a cake together in a kitchen."
102+
negative_prompt = "Static, 2D cartoon, cartoon, 2d animation, paintings, images, worst quality, low quality, ugly, deformed, walking backwards"
103+
104+
output = pipe(
105+
prompt=prompt,
106+
negative_prompt=negative_prompt,
107+
height=512,
108+
width=768,
109+
num_frames=241,
110+
num_inference_steps=50,
111+
guidance_scale=5.0,
112+
).frames[0]
113+
114+
export_to_video(output, "output.mp4", fps=24, quality=9)
115+
```
116+
117+
### Diffusion Distilled model
118+
**⚠️ Warning!** all nocfg and diffusion distilled models should be inferred without CFG (```guidance_scale=1.0```):
119+
120+
```python
121+
model_id = "ai-forever/Kandinsky-5.0-T2V-Lite-distilled16steps-5s-Diffusers"
122+
pipe = Kandinsky5T2VPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
123+
pipe = pipe.to("cuda")
124+
125+
output = pipe(
126+
prompt="A beautiful sunset over mountains",
127+
num_inference_steps=16, # <--- Model is distilled in 16 steps
128+
guidance_scale=1.0, # <--- no CFG
129+
).frames[0]
130+
131+
export_to_video(output, "output.mp4", fps=24, quality=9)
132+
```
133+
134+
135+
## Citation
136+
```bibtex
137+
@misc{kandinsky2025,
138+
author = {Alexey Letunovskiy and Maria Kovaleva and Ivan Kirillov and Lev Novitskiy and Denis Koposov and
139+
Dmitrii Mikhailov and Anna Averchenkova and Andrey Shutkin and Julia Agafonova and Olga Kim and
140+
Anastasiia Kargapoltseva and Nikita Kiselev and Vladimir Arkhipkin and Vladimir Korviakov and
141+
Nikolai Gerasimenko and Denis Parkhomenko and Anna Dmitrienko and Anastasia Maltseva and
142+
Kirill Chernyshev and Ilia Vasiliev and Viacheslav Vasilev and Vladimir Polovnikov and
143+
Yury Kolabushin and Alexander Belykh and Mikhail Mamaev and Anastasia Aliaskina and
144+
Tatiana Nikulina and Polina Gavrilova and Denis Dimitrov},
145+
title = {Kandinsky 5.0: A family of diffusion models for Video & Image generation},
146+
howpublished = {\url{https://github.com/ai-forever/Kandinsky-5}},
147+
year = 2025
148+
}
149+
```

docs/source/en/optimization/attention_backends.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ Refer to the table below for an overview of the available attention families and
2121
| attention family | main feature |
2222
|---|---|
2323
| FlashAttention | minimizes memory reads/writes through tiling and recomputation |
24+
| AI Tensor Engine for ROCm | FlashAttention implementation optimized for AMD ROCm accelerators |
2425
| SageAttention | quantizes attention to int8 |
2526
| PyTorch native | built-in PyTorch implementation using [scaled_dot_product_attention](./fp16#scaled-dot-product-attention) |
2627
| xFormers | memory-efficient attention with support for various attention kernels |
@@ -139,6 +140,7 @@ Refer to the table below for a complete list of available attention backends and
139140
| `_native_xla` | [PyTorch native](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.SDPBackend.html#torch.nn.attention.SDPBackend) | XLA-optimized attention |
140141
| `flash` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-2 |
141142
| `flash_varlen` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | Variable length FlashAttention |
143+
| `aiter` | [AI Tensor Engine for ROCm](https://github.com/ROCm/aiter) | FlashAttention for AMD ROCm |
142144
| `_flash_3` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-3 |
143145
| `_flash_varlen_3` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | Variable length FlashAttention-3 |
144146
| `_flash_3_hub` | [FlashAttention](https://github.com/Dao-AILab/flash-attention) | FlashAttention-3 from kernels |

src/diffusers/loaders/lora_conversion_utils.py

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1977,14 +1977,34 @@ def get_alpha_scales(down_weight, alpha_key):
19771977
"time_projection.1.diff_b"
19781978
)
19791979

1980-
if any("head.head" in k for k in state_dict):
1981-
converted_state_dict["proj_out.lora_A.weight"] = original_state_dict.pop(
1982-
f"head.head.{lora_down_key}.weight"
1983-
)
1984-
converted_state_dict["proj_out.lora_B.weight"] = original_state_dict.pop(f"head.head.{lora_up_key}.weight")
1980+
if any("head.head" in k for k in original_state_dict):
1981+
if any(f"head.head.{lora_down_key}.weight" in k for k in state_dict):
1982+
converted_state_dict["proj_out.lora_A.weight"] = original_state_dict.pop(
1983+
f"head.head.{lora_down_key}.weight"
1984+
)
1985+
if any(f"head.head.{lora_up_key}.weight" in k for k in state_dict):
1986+
converted_state_dict["proj_out.lora_B.weight"] = original_state_dict.pop(
1987+
f"head.head.{lora_up_key}.weight"
1988+
)
19851989
if "head.head.diff_b" in original_state_dict:
19861990
converted_state_dict["proj_out.lora_B.bias"] = original_state_dict.pop("head.head.diff_b")
19871991

1992+
# Notes: https://huggingface.co/lightx2v/Wan2.2-Distill-Loras
1993+
# This is my (sayakpaul) assumption that this particular key belongs to the down matrix.
1994+
# Since for this particular LoRA, we don't have the corresponding up matrix, I will use
1995+
# an identity.
1996+
if any("head.head" in k and k.endswith(".diff") for k in state_dict):
1997+
if f"head.head.{lora_down_key}.weight" in state_dict:
1998+
logger.info(
1999+
f"The state dict seems to be have both `head.head.diff` and `head.head.{lora_down_key}.weight` keys, which is unexpected."
2000+
)
2001+
converted_state_dict["proj_out.lora_A.weight"] = original_state_dict.pop("head.head.diff")
2002+
down_matrix_head = converted_state_dict["proj_out.lora_A.weight"]
2003+
up_matrix_shape = (down_matrix_head.shape[0], converted_state_dict["proj_out.lora_B.bias"].shape[0])
2004+
converted_state_dict["proj_out.lora_B.weight"] = torch.eye(
2005+
*up_matrix_shape, dtype=down_matrix_head.dtype, device=down_matrix_head.device
2006+
).T
2007+
19882008
for text_time in ["text_embedding", "time_embedding"]:
19892009
if any(text_time in k for k in original_state_dict):
19902010
for b_n in [0, 2]:

src/diffusers/models/attention_dispatch.py

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@
2828

2929
from ..utils import (
3030
get_logger,
31+
is_aiter_available,
32+
is_aiter_version,
3133
is_flash_attn_3_available,
3234
is_flash_attn_available,
3335
is_flash_attn_version,
@@ -48,13 +50,15 @@
4850
from ._modeling_parallel import ParallelConfig
4951

5052
_REQUIRED_FLASH_VERSION = "2.6.3"
53+
_REQUIRED_AITER_VERSION = "0.1.5"
5154
_REQUIRED_SAGE_VERSION = "2.1.1"
5255
_REQUIRED_FLEX_VERSION = "2.5.0"
5356
_REQUIRED_XLA_VERSION = "2.2"
5457
_REQUIRED_XFORMERS_VERSION = "0.0.29"
5558

5659
_CAN_USE_FLASH_ATTN = is_flash_attn_available() and is_flash_attn_version(">=", _REQUIRED_FLASH_VERSION)
5760
_CAN_USE_FLASH_ATTN_3 = is_flash_attn_3_available()
61+
_CAN_USE_AITER_ATTN = is_aiter_available() and is_aiter_version(">=", _REQUIRED_AITER_VERSION)
5862
_CAN_USE_SAGE_ATTN = is_sageattention_available() and is_sageattention_version(">=", _REQUIRED_SAGE_VERSION)
5963
_CAN_USE_FLEX_ATTN = is_torch_version(">=", _REQUIRED_FLEX_VERSION)
6064
_CAN_USE_NPU_ATTN = is_torch_npu_available()
@@ -79,6 +83,11 @@
7983
flash_attn_3_func = None
8084
flash_attn_3_varlen_func = None
8185

86+
if _CAN_USE_AITER_ATTN:
87+
from aiter import flash_attn_func as aiter_flash_attn_func
88+
else:
89+
aiter_flash_attn_func = None
90+
8291
if _CAN_USE_SAGE_ATTN:
8392
from sageattention import (
8493
sageattn,
@@ -167,6 +176,9 @@ class AttentionBackendName(str, Enum):
167176
_FLASH_3_HUB = "_flash_3_hub"
168177
# _FLASH_VARLEN_3_HUB = "_flash_varlen_3_hub" # not supported yet.
169178

179+
# `aiter`
180+
AITER = "aiter"
181+
170182
# PyTorch native
171183
FLEX = "flex"
172184
NATIVE = "native"
@@ -418,6 +430,12 @@ def _check_attention_backend_requirements(backend: AttentionBackendName) -> None
418430
f"Backend '{backend.value}' is not usable because the `kernels` package isn't available. Please install it with `pip install kernels`."
419431
)
420432

433+
elif backend == AttentionBackendName.AITER:
434+
if not _CAN_USE_AITER_ATTN:
435+
raise RuntimeError(
436+
f"Aiter Attention backend '{backend.value}' is not usable because of missing package or the version is too old. Please install `aiter>={_REQUIRED_AITER_VERSION}`."
437+
)
438+
421439
elif backend in [
422440
AttentionBackendName.SAGE,
423441
AttentionBackendName.SAGE_VARLEN,
@@ -1425,6 +1443,47 @@ def _flash_varlen_attention_3(
14251443
return (out, lse) if return_lse else out
14261444

14271445

1446+
@_AttentionBackendRegistry.register(
1447+
AttentionBackendName.AITER,
1448+
constraints=[_check_device_cuda, _check_qkv_dtype_bf16_or_fp16, _check_shape],
1449+
)
1450+
def _aiter_flash_attention(
1451+
query: torch.Tensor,
1452+
key: torch.Tensor,
1453+
value: torch.Tensor,
1454+
dropout_p: float = 0.0,
1455+
is_causal: bool = False,
1456+
scale: Optional[float] = None,
1457+
return_lse: bool = False,
1458+
_parallel_config: Optional["ParallelConfig"] = None,
1459+
) -> torch.Tensor:
1460+
if not return_lse and torch.is_grad_enabled():
1461+
# aiter requires return_lse=True by assertion when gradients are enabled.
1462+
out, lse, *_ = aiter_flash_attn_func(
1463+
q=query,
1464+
k=key,
1465+
v=value,
1466+
dropout_p=dropout_p,
1467+
softmax_scale=scale,
1468+
causal=is_causal,
1469+
return_lse=True,
1470+
)
1471+
else:
1472+
out = aiter_flash_attn_func(
1473+
q=query,
1474+
k=key,
1475+
v=value,
1476+
dropout_p=dropout_p,
1477+
softmax_scale=scale,
1478+
causal=is_causal,
1479+
return_lse=return_lse,
1480+
)
1481+
if return_lse:
1482+
out, lse, *_ = out
1483+
1484+
return (out, lse) if return_lse else out
1485+
1486+
14281487
@_AttentionBackendRegistry.register(
14291488
AttentionBackendName.FLEX,
14301489
constraints=[_check_attn_mask_or_causal, _check_device, _check_shape],

0 commit comments

Comments
 (0)