Skip to content

Commit b0fcc81

Browse files
authored
Merge branch 'main' into flux2
2 parents 7587a50 + 17c0e79 commit b0fcc81

File tree

119 files changed

+6389
-438
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

119 files changed

+6389
-438
lines changed

docs/source/en/_toctree.yml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -401,6 +401,8 @@
401401
title: WanAnimateTransformer3DModel
402402
- local: api/models/wan_transformer_3d
403403
title: WanTransformer3DModel
404+
- local: api/models/z_image_transformer2d
405+
title: ZImageTransformer2DModel
404406
title: Transformers
405407
- sections:
406408
- local: api/models/stable_cascade_unet
@@ -551,6 +553,8 @@
551553
title: Kandinsky 2.2
552554
- local: api/pipelines/kandinsky3
553555
title: Kandinsky 3
556+
- local: api/pipelines/kandinsky5_image
557+
title: Kandinsky 5.0 Image
554558
- local: api/pipelines/kolors
555559
title: Kolors
556560
- local: api/pipelines/latent_consistency_models
@@ -646,6 +650,8 @@
646650
title: VisualCloze
647651
- local: api/pipelines/wuerstchen
648652
title: Wuerstchen
653+
- local: api/pipelines/z_image
654+
title: Z-Image
649655
title: Image
650656
- sections:
651657
- local: api/pipelines/allegro
@@ -664,8 +670,6 @@
664670
title: HunyuanVideo1.5
665671
- local: api/pipelines/i2vgenxl
666672
title: I2VGen-XL
667-
- local: api/pipelines/kandinsky5_image
668-
title: Kandinsky 5.0 Image
669673
- local: api/pipelines/kandinsky5_video
670674
title: Kandinsky 5.0 Video
671675
- local: api/pipelines/latte

docs/source/en/api/cache.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,3 +34,9 @@ Cache methods speedup diffusion transformers by storing and reusing intermediate
3434
[[autodoc]] FirstBlockCacheConfig
3535

3636
[[autodoc]] apply_first_block_cache
37+
38+
### TaylorSeerCacheConfig
39+
40+
[[autodoc]] TaylorSeerCacheConfig
41+
42+
[[autodoc]] apply_taylorseer_cache
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# ZImageTransformer2DModel
14+
15+
A Transformer model for image-like data from [Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo).
16+
17+
## ZImageTransformer2DModel
18+
19+
[[autodoc]] ZImageTransformer2DModel

docs/source/en/api/pipelines/kandinsky5_image.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License.
1111

1212
[Kandinsky 5.0](https://arxiv.org/abs/2511.14993) is a family of diffusion models for Video & Image generation.
1313

14-
Kandinsky 5.0 Image Lite is a lightweight image generation model (6B parameters)
14+
Kandinsky 5.0 Image Lite is a lightweight image generation model (6B parameters).
1515

1616
The model introduces several key innovations:
1717
- **Latent diffusion pipeline** with **Flow Matching** for improved training stability
@@ -21,10 +21,14 @@ The model introduces several key innovations:
2121

2222
The original codebase can be found at [kandinskylab/Kandinsky-5](https://github.com/kandinskylab/Kandinsky-5).
2323

24+
> [!TIP]
25+
> Check out the [Kandinsky Lab](https://huggingface.co/kandinskylab) organization on the Hub for the official model checkpoints for text-to-video generation, including pretrained, SFT, no-CFG, and distilled variants.
26+
2427

2528
## Available Models
2629

2730
Kandinsky 5.0 Image Lite:
31+
2832
| model_id | Description | Use Cases |
2933
|------------|-------------|-----------|
3034
| [**kandinskylab/Kandinsky-5.0-T2I-Lite-sft-Diffusers**](https://huggingface.co/kandinskylab/Kandinsky-5.0-T2I-Lite-sft-Diffusers) | 6B image Supervised Fine-Tuned model | Highest generation quality |

docs/source/en/api/pipelines/kandinsky5_video.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ The original codebase can be found at [kandinskylab/Kandinsky-5](https://github.
3030
## Available Models
3131

3232
Kandinsky 5.0 T2V Pro:
33+
3334
| model_id | Description | Use Cases |
3435
|------------|-------------|-----------|
3536
| **kandinskylab/Kandinsky-5.0-T2V-Pro-sft-5s-Diffusers** | 5 second Text-to-Video Pro model | High-quality text-to-video generation |
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# Z-Image
14+
15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
19+
[Z-Image](https://huggingface.co/papers/2511.22699) is a powerful and highly efficient image generation model with 6B parameters. Currently there's only one model with two more to be released:
20+
21+
|Model|Hugging Face|
22+
|---|---|
23+
|Z-Image-Turbo|https://huggingface.co/Tongyi-MAI/Z-Image-Turbo|
24+
25+
## Z-Image-Turbo
26+
27+
Z-Image-Turbo is a distilled version of Z-Image that matches or exceeds leading competitors with only 8 NFEs (Number of Function Evaluations). It offers sub-second inference latency on enterprise-grade H800 GPUs and fits comfortably within 16G VRAM consumer devices. It excels in photorealistic image generation, bilingual text rendering (English & Chinese), and robust instruction adherence.
28+
29+
## Image-to-image
30+
31+
Use [`ZImageImg2ImgPipeline`] to transform an existing image based on a text prompt.
32+
33+
```python
34+
import torch
35+
from diffusers import ZImageImg2ImgPipeline
36+
from diffusers.utils import load_image
37+
38+
pipe = ZImageImg2ImgPipeline.from_pretrained("Tongyi-MAI/Z-Image-Turbo", torch_dtype=torch.bfloat16)
39+
pipe.to("cuda")
40+
41+
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
42+
init_image = load_image(url).resize((1024, 1024))
43+
44+
prompt = "A fantasy landscape with mountains and a river, detailed, vibrant colors"
45+
image = pipe(
46+
prompt,
47+
image=init_image,
48+
strength=0.6,
49+
num_inference_steps=9,
50+
guidance_scale=0.0,
51+
generator=torch.Generator("cuda").manual_seed(42),
52+
).images[0]
53+
image.save("zimage_img2img.png")
54+
```
55+
56+
## ZImagePipeline
57+
58+
[[autodoc]] ZImagePipeline
59+
- all
60+
- __call__
61+
62+
## ZImageImg2ImgPipeline
63+
64+
[[autodoc]] ZImageImg2ImgPipeline
65+
- all
66+
- __call__

docs/source/en/optimization/cache.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,4 +66,35 @@ config = FasterCacheConfig(
6666
tensor_format="BFCHW",
6767
)
6868
pipeline.transformer.enable_cache(config)
69+
```
70+
71+
## TaylorSeer Cache
72+
73+
[TaylorSeer Cache](https://huggingface.co/papers/2403.06923) accelerates diffusion inference by using Taylor series expansions to approximate and cache intermediate activations across denoising steps. The method predicts future outputs based on past computations, reusing them at specified intervals to reduce redundant calculations.
74+
75+
This caching mechanism delivers strong results with minimal additional memory overhead. For detailed performance analysis, see [our findings here](https://github.com/huggingface/diffusers/pull/12648#issuecomment-3610615080).
76+
77+
To enable TaylorSeer Cache, create a [`TaylorSeerCacheConfig`] and pass it to your pipeline's transformer:
78+
79+
- `cache_interval`: Number of steps to reuse cached outputs before performing a full forward pass
80+
- `disable_cache_before_step`: Initial steps that use full computations to gather data for approximations
81+
- `max_order`: Approximation accuracy (in theory, higher values improve quality but increase memory usage but we recommend it should be set to `1`)
82+
83+
```python
84+
import torch
85+
from diffusers import FluxPipeline, TaylorSeerCacheConfig
86+
87+
pipe = FluxPipeline.from_pretrained(
88+
"black-forest-labs/FLUX.1-dev",
89+
torch_dtype=torch.bfloat16,
90+
)
91+
pipe.to("cuda")
92+
93+
config = TaylorSeerCacheConfig(
94+
cache_interval=5,
95+
max_order=1,
96+
disable_cache_before_step=10,
97+
taylor_factors_dtype=torch.bfloat16,
98+
)
99+
pipe.transformer.enable_cache(config)
69100
```

docs/source/en/quantization/modelopt.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. -->
1111

1212
# NVIDIA ModelOpt
1313

14-
[NVIDIA-ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.
14+
[NVIDIA-ModelOpt](https://github.com/NVIDIA/Model-Optimizer) is a unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed.
1515

1616
Before you begin, make sure you have nvidia_modelopt installed.
1717

@@ -57,7 +57,7 @@ image.save("output.png")
5757
>
5858
> The quantization methods in NVIDIA-ModelOpt are designed to reduce the memory footprint of model weights using various QAT (Quantization-Aware Training) and PTQ (Post-Training Quantization) techniques while maintaining model performance. However, the actual performance gain during inference depends on the deployment framework (e.g., TRT-LLM, TensorRT) and the specific hardware configuration.
5959
>
60-
> More details can be found [here](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples).
60+
> More details can be found [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples).
6161
6262
## NVIDIAModelOptConfig
6363

@@ -86,7 +86,7 @@ The quantization methods supported are as follows:
8686
| **NVFP4** | `nvfp4 weight only`, `nvfp4 block quantization` | `quant_type`, `quant_type + channel_quantize + block_quantize` | `channel_quantize = -1 is only supported for now`|
8787

8888

89-
Refer to the [official modelopt documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/) for a better understanding of the available quantization methods and the exhaustive list of configuration options available.
89+
Refer to the [official modelopt documentation](https://nvidia.github.io/Model-Optimizer/) for a better understanding of the available quantization methods and the exhaustive list of configuration options available.
9090

9191
## Serializing and Deserializing quantized models
9292

docs/source/en/training/distributed_inference.md

Lines changed: 68 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,8 @@ By selectively loading and unloading the models you need at a given stage and sh
237237

238238
Use [`~ModelMixin.set_attention_backend`] to switch to a more optimized attention backend. Refer to this [table](../optimization/attention_backends#available-backends) for a complete list of available backends.
239239

240+
Most attention backends are compatible with context parallelism. Open an [issue](https://github.com/huggingface/diffusers/issues/new) if a backend is not compatible.
241+
240242
### Ring Attention
241243

242244
Key (K) and value (V) representations communicate between devices using [Ring Attention](https://huggingface.co/papers/2310.01889). This ensures each split sees every other token's K/V. Each GPU computes attention for its local K/V and passes it to the next GPU in the ring. No single GPU holds the full sequence, which reduces communication latency.
@@ -245,38 +247,58 @@ Pass a [`ContextParallelConfig`] to the `parallel_config` argument of the transf
245247

246248
```py
247249
import torch
248-
from diffusers import AutoModel, QwenImagePipeline, ContextParallelConfig
249-
250-
try:
251-
torch.distributed.init_process_group("nccl")
252-
rank = torch.distributed.get_rank()
253-
device = torch.device("cuda", rank % torch.cuda.device_count())
250+
from torch import distributed as dist
251+
from diffusers import DiffusionPipeline, ContextParallelConfig
252+
253+
def setup_distributed():
254+
if not dist.is_initialized():
255+
dist.init_process_group(backend="nccl")
256+
rank = dist.get_rank()
257+
device = torch.device(f"cuda:{rank}")
254258
torch.cuda.set_device(device)
255-
256-
transformer = AutoModel.from_pretrained("Qwen/Qwen-Image", subfolder="transformer", torch_dtype=torch.bfloat16, parallel_config=ContextParallelConfig(ring_degree=2))
257-
pipeline = QwenImagePipeline.from_pretrained("Qwen/Qwen-Image", transformer=transformer, torch_dtype=torch.bfloat16, device_map="cuda")
258-
pipeline.transformer.set_attention_backend("flash")
259+
return device
260+
261+
def main():
262+
device = setup_distributed()
263+
world_size = dist.get_world_size()
264+
265+
pipeline = DiffusionPipeline.from_pretrained(
266+
"black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16, device_map=device
267+
)
268+
pipeline.transformer.set_attention_backend("_native_cudnn")
269+
270+
cp_config = ContextParallelConfig(ring_degree=world_size)
271+
pipeline.transformer.enable_parallelism(config=cp_config)
259272

260273
prompt = """
261274
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
262275
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
263276
"""
264-
277+
265278
# Must specify generator so all ranks start with same latents (or pass your own)
266279
generator = torch.Generator().manual_seed(42)
267-
image = pipeline(prompt, num_inference_steps=50, generator=generator).images[0]
268-
269-
if rank == 0:
270-
image.save("output.png")
271-
272-
except Exception as e:
273-
print(f"An error occurred: {e}")
274-
torch.distributed.breakpoint()
275-
raise
276-
277-
finally:
278-
if torch.distributed.is_initialized():
279-
torch.distributed.destroy_process_group()
280+
image = pipeline(
281+
prompt,
282+
guidance_scale=3.5,
283+
num_inference_steps=50,
284+
generator=generator,
285+
).images[0]
286+
287+
if dist.get_rank() == 0:
288+
image.save(f"output.png")
289+
290+
if dist.is_initialized():
291+
dist.destroy_process_group()
292+
293+
294+
if __name__ == "__main__":
295+
main()
296+
```
297+
298+
The script above needs to be run with a distributed launcher, such as [torchrun](https://docs.pytorch.org/docs/stable/elastic/run.html), that is compatible with PyTorch. `--nproc-per-node` is set to the number of GPUs available.
299+
300+
```shell
301+
torchrun --nproc-per-node 2 above_script.py
280302
```
281303

282304
### Ulysses Attention
@@ -288,5 +310,26 @@ finally:
288310
Pass the [`ContextParallelConfig`] to [`~ModelMixin.enable_parallelism`].
289311

290312
```py
313+
# Depending on the number of GPUs available.
291314
pipeline.transformer.enable_parallelism(config=ContextParallelConfig(ulysses_degree=2))
292-
```
315+
```
316+
317+
### parallel_config
318+
319+
Pass `parallel_config` during model initialization to enable context parallelism.
320+
321+
```py
322+
CKPT_ID = "black-forest-labs/FLUX.1-dev"
323+
324+
cp_config = ContextParallelConfig(ring_degree=2)
325+
transformer = AutoModel.from_pretrained(
326+
CKPT_ID,
327+
subfolder="transformer",
328+
torch_dtype=torch.bfloat16,
329+
parallel_config=cp_config
330+
)
331+
332+
pipeline = DiffusionPipeline.from_pretrained(
333+
CKPT_ID, transformer=transformer, torch_dtype=torch.bfloat16,
334+
).to(device)
335+
```

examples/advanced_diffusion_training/train_dreambooth_lora_flux_advanced.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@
9494
import wandb
9595

9696
# Will error if the minimal version of diffusers is not installed. Remove at your own risks.
97-
check_min_version("0.36.0.dev0")
97+
check_min_version("0.37.0.dev0")
9898

9999
logger = get_logger(__name__)
100100

0 commit comments

Comments
 (0)