Skip to content

Commit 41ec606

Browse files
authored
Merge branch 'main' into introduce_autopipeline_for_text2video
2 parents e53b98d + 671149e commit 41ec606

29 files changed

+1035
-120
lines changed

docs/source/en/_toctree.yml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -401,6 +401,8 @@
401401
title: WanAnimateTransformer3DModel
402402
- local: api/models/wan_transformer_3d
403403
title: WanTransformer3DModel
404+
- local: api/models/z_image_transformer2d
405+
title: ZImageTransformer2DModel
404406
title: Transformers
405407
- sections:
406408
- local: api/models/stable_cascade_unet
@@ -551,6 +553,8 @@
551553
title: Kandinsky 2.2
552554
- local: api/pipelines/kandinsky3
553555
title: Kandinsky 3
556+
- local: api/pipelines/kandinsky5_image
557+
title: Kandinsky 5.0 Image
554558
- local: api/pipelines/kolors
555559
title: Kolors
556560
- local: api/pipelines/latent_consistency_models
@@ -646,6 +650,8 @@
646650
title: VisualCloze
647651
- local: api/pipelines/wuerstchen
648652
title: Wuerstchen
653+
- local: api/pipelines/z_image
654+
title: Z-Image
649655
title: Image
650656
- sections:
651657
- local: api/pipelines/allegro
@@ -664,8 +670,6 @@
664670
title: HunyuanVideo1.5
665671
- local: api/pipelines/i2vgenxl
666672
title: I2VGen-XL
667-
- local: api/pipelines/kandinsky5_image
668-
title: Kandinsky 5.0 Image
669673
- local: api/pipelines/kandinsky5_video
670674
title: Kandinsky 5.0 Video
671675
- local: api/pipelines/latte

docs/source/en/api/cache.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,3 +34,9 @@ Cache methods speedup diffusion transformers by storing and reusing intermediate
3434
[[autodoc]] FirstBlockCacheConfig
3535

3636
[[autodoc]] apply_first_block_cache
37+
38+
### TaylorSeerCacheConfig
39+
40+
[[autodoc]] TaylorSeerCacheConfig
41+
42+
[[autodoc]] apply_taylorseer_cache
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# ZImageTransformer2DModel
14+
15+
A Transformer model for image-like data from [Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo).
16+
17+
## ZImageTransformer2DModel
18+
19+
[[autodoc]] ZImageTransformer2DModel

docs/source/en/api/pipelines/kandinsky5_image.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License.
1111

1212
[Kandinsky 5.0](https://arxiv.org/abs/2511.14993) is a family of diffusion models for Video & Image generation.
1313

14-
Kandinsky 5.0 Image Lite is a lightweight image generation model (6B parameters)
14+
Kandinsky 5.0 Image Lite is a lightweight image generation model (6B parameters).
1515

1616
The model introduces several key innovations:
1717
- **Latent diffusion pipeline** with **Flow Matching** for improved training stability
@@ -21,10 +21,14 @@ The model introduces several key innovations:
2121

2222
The original codebase can be found at [kandinskylab/Kandinsky-5](https://github.com/kandinskylab/Kandinsky-5).
2323

24+
> [!TIP]
25+
> Check out the [Kandinsky Lab](https://huggingface.co/kandinskylab) organization on the Hub for the official model checkpoints for text-to-video generation, including pretrained, SFT, no-CFG, and distilled variants.
26+
2427

2528
## Available Models
2629

2730
Kandinsky 5.0 Image Lite:
31+
2832
| model_id | Description | Use Cases |
2933
|------------|-------------|-----------|
3034
| [**kandinskylab/Kandinsky-5.0-T2I-Lite-sft-Diffusers**](https://huggingface.co/kandinskylab/Kandinsky-5.0-T2I-Lite-sft-Diffusers) | 6B image Supervised Fine-Tuned model | Highest generation quality |

docs/source/en/api/pipelines/kandinsky5_video.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ The original codebase can be found at [kandinskylab/Kandinsky-5](https://github.
3030
## Available Models
3131

3232
Kandinsky 5.0 T2V Pro:
33+
3334
| model_id | Description | Use Cases |
3435
|------------|-------------|-----------|
3536
| **kandinskylab/Kandinsky-5.0-T2V-Pro-sft-5s-Diffusers** | 5 second Text-to-Video Pro model | High-quality text-to-video generation |
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# Z-Image
14+
15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
19+
[Z-Image](https://huggingface.co/papers/2511.22699) is a powerful and highly efficient image generation model with 6B parameters. Currently there's only one model with two more to be released:
20+
21+
|Model|Hugging Face|
22+
|---|---|
23+
|Z-Image-Turbo|https://huggingface.co/Tongyi-MAI/Z-Image-Turbo|
24+
25+
## Z-Image-Turbo
26+
27+
Z-Image-Turbo is a distilled version of Z-Image that matches or exceeds leading competitors with only 8 NFEs (Number of Function Evaluations). It offers sub-second inference latency on enterprise-grade H800 GPUs and fits comfortably within 16G VRAM consumer devices. It excels in photorealistic image generation, bilingual text rendering (English & Chinese), and robust instruction adherence.
28+
29+
## ZImagePipeline
30+
31+
[[autodoc]] ZImagePipeline
32+
- all
33+
- __call__

docs/source/en/optimization/cache.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,4 +66,35 @@ config = FasterCacheConfig(
6666
tensor_format="BFCHW",
6767
)
6868
pipeline.transformer.enable_cache(config)
69+
```
70+
71+
## TaylorSeer Cache
72+
73+
[TaylorSeer Cache](https://huggingface.co/papers/2403.06923) accelerates diffusion inference by using Taylor series expansions to approximate and cache intermediate activations across denoising steps. The method predicts future outputs based on past computations, reusing them at specified intervals to reduce redundant calculations.
74+
75+
This caching mechanism delivers strong results with minimal additional memory overhead. For detailed performance analysis, see [our findings here](https://github.com/huggingface/diffusers/pull/12648#issuecomment-3610615080).
76+
77+
To enable TaylorSeer Cache, create a [`TaylorSeerCacheConfig`] and pass it to your pipeline's transformer:
78+
79+
- `cache_interval`: Number of steps to reuse cached outputs before performing a full forward pass
80+
- `disable_cache_before_step`: Initial steps that use full computations to gather data for approximations
81+
- `max_order`: Approximation accuracy (in theory, higher values improve quality but increase memory usage but we recommend it should be set to `1`)
82+
83+
```python
84+
import torch
85+
from diffusers import FluxPipeline, TaylorSeerCacheConfig
86+
87+
pipe = FluxPipeline.from_pretrained(
88+
"black-forest-labs/FLUX.1-dev",
89+
torch_dtype=torch.bfloat16,
90+
)
91+
pipe.to("cuda")
92+
93+
config = TaylorSeerCacheConfig(
94+
cache_interval=5,
95+
max_order=1,
96+
disable_cache_before_step=10,
97+
taylor_factors_dtype=torch.bfloat16,
98+
)
99+
pipe.transformer.enable_cache(config)
69100
```

docs/source/en/quantization/modelopt.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. -->
1111

1212
# NVIDIA ModelOpt
1313

14-
[NVIDIA-ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.
14+
[NVIDIA-ModelOpt](https://github.com/NVIDIA/Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.
1515

1616
Before you begin, make sure you have nvidia_modelopt installed.
1717

@@ -57,7 +57,7 @@ image.save("output.png")
5757
>
5858
> The quantization methods in NVIDIA-ModelOpt are designed to reduce the memory footprint of model weights using various QAT (Quantization-Aware Training) and PTQ (Post-Training Quantization) techniques while maintaining model performance. However, the actual performance gain during inference depends on the deployment framework (e.g., TRT-LLM, TensorRT) and the specific hardware configuration.
5959
>
60-
> More details can be found [here](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples).
60+
> More details can be found [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples).
6161
6262
## NVIDIAModelOptConfig
6363

@@ -86,7 +86,7 @@ The quantization methods supported are as follows:
8686
| **NVFP4** | `nvfp4 weight only`, `nvfp4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`|
8787

8888

89-
Refer to the [official modelopt documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/) for a better understanding of the available quantization methods and the exhaustive list of configuration options available.
89+
Refer to the [official modelopt documentation](https://nvidia.github.io/Model-Optimizer/) for a better understanding of the available quantization methods and the exhaustive list of configuration options available.
9090

9191
## Serializing and Deserializing quantized models
9292

scripts/convert_hunyuan_video1_5_to_diffusers.py

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,11 @@
6969
"target_size": 960,
7070
"task_type": "i2v",
7171
},
72+
"480p_i2v_step_distilled": {
73+
"target_size": 640,
74+
"task_type": "i2v",
75+
"use_meanflow": True,
76+
},
7277
}
7378

7479
SCHEDULER_CONFIGS = {
@@ -93,6 +98,9 @@
9398
"720p_i2v_distilled": {
9499
"shift": 7.0,
95100
},
101+
"480p_i2v_step_distilled": {
102+
"shift": 7.0,
103+
},
96104
}
97105

98106
GUIDANCE_CONFIGS = {
@@ -117,6 +125,9 @@
117125
"720p_i2v_distilled": {
118126
"guidance_scale": 1.0,
119127
},
128+
"480p_i2v_step_distilled": {
129+
"guidance_scale": 1.0,
130+
},
120131
}
121132

122133

@@ -126,7 +137,7 @@ def swap_scale_shift(weight):
126137
return new_weight
127138

128139

129-
def convert_hyvideo15_transformer_to_diffusers(original_state_dict):
140+
def convert_hyvideo15_transformer_to_diffusers(original_state_dict, config=None):
130141
"""
131142
Convert HunyuanVideo 1.5 original checkpoint to Diffusers format.
132143
"""
@@ -142,6 +153,20 @@ def convert_hyvideo15_transformer_to_diffusers(original_state_dict):
142153
)
143154
converted_state_dict["time_embed.timestep_embedder.linear_2.bias"] = original_state_dict.pop("time_in.mlp.2.bias")
144155

156+
if config.use_meanflow:
157+
converted_state_dict["time_embed.timestep_embedder_r.linear_1.weight"] = original_state_dict.pop(
158+
"time_r_in.mlp.0.weight"
159+
)
160+
converted_state_dict["time_embed.timestep_embedder_r.linear_1.bias"] = original_state_dict.pop(
161+
"time_r_in.mlp.0.bias"
162+
)
163+
converted_state_dict["time_embed.timestep_embedder_r.linear_2.weight"] = original_state_dict.pop(
164+
"time_r_in.mlp.2.weight"
165+
)
166+
converted_state_dict["time_embed.timestep_embedder_r.linear_2.bias"] = original_state_dict.pop(
167+
"time_r_in.mlp.2.bias"
168+
)
169+
145170
# 2. context_embedder.time_text_embed.timestep_embedder <- txt_in.t_embedder
146171
converted_state_dict["context_embedder.time_text_embed.timestep_embedder.linear_1.weight"] = (
147172
original_state_dict.pop("txt_in.t_embedder.mlp.0.weight")
@@ -627,7 +652,7 @@ def convert_transformer(args):
627652
config = TRANSFORMER_CONFIGS[args.transformer_type]
628653
with init_empty_weights():
629654
transformer = HunyuanVideo15Transformer3DModel(**config)
630-
state_dict = convert_hyvideo15_transformer_to_diffusers(original_state_dict)
655+
state_dict = convert_hyvideo15_transformer_to_diffusers(original_state_dict, config=transformer.config)
631656
transformer.load_state_dict(state_dict, strict=True, assign=True)
632657

633658
return transformer

src/diffusers/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -169,10 +169,12 @@
169169
"LayerSkipConfig",
170170
"PyramidAttentionBroadcastConfig",
171171
"SmoothedEnergyGuidanceConfig",
172+
"TaylorSeerCacheConfig",
172173
"apply_faster_cache",
173174
"apply_first_block_cache",
174175
"apply_layer_skip",
175176
"apply_pyramid_attention_broadcast",
177+
"apply_taylorseer_cache",
176178
]
177179
)
178180
_import_structure["models"].extend(
@@ -900,10 +902,12 @@
900902
LayerSkipConfig,
901903
PyramidAttentionBroadcastConfig,
902904
SmoothedEnergyGuidanceConfig,
905+
TaylorSeerCacheConfig,
903906
apply_faster_cache,
904907
apply_first_block_cache,
905908
apply_layer_skip,
906909
apply_pyramid_attention_broadcast,
910+
apply_taylorseer_cache,
907911
)
908912
from .models import (
909913
AllegroTransformer3DModel,

0 commit comments

Comments
 (0)