Skip to content

Commit a663473

Browse files
authored
Merge branch 'main' into feature/ltx
2 parents 166e42c + c372615 commit a663473

File tree

52 files changed

+3294
-236
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+3294
-236
lines changed

.github/workflows/nightly_tests.yml

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,7 @@ jobs:
142142
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
143143
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
144144
CUBLAS_WORKSPACE_CONFIG: :16:8
145+
RUN_COMPILE: yes
145146
run: |
146147
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
147148
-s -v -k "not Flax and not Onnx" \
@@ -525,6 +526,60 @@ jobs:
525526
pip install slack_sdk tabulate
526527
python utils/log_reports.py >> $GITHUB_STEP_SUMMARY
527528
529+
run_nightly_pipeline_level_quantization_tests:
530+
name: Torch quantization nightly tests
531+
strategy:
532+
fail-fast: false
533+
max-parallel: 2
534+
runs-on:
535+
group: aws-g6e-xlarge-plus
536+
container:
537+
image: diffusers/diffusers-pytorch-cuda
538+
options: --shm-size "20gb" --ipc host --gpus 0
539+
steps:
540+
- name: Checkout diffusers
541+
uses: actions/checkout@v3
542+
with:
543+
fetch-depth: 2
544+
- name: NVIDIA-SMI
545+
run: nvidia-smi
546+
- name: Install dependencies
547+
run: |
548+
python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
549+
python -m uv pip install -e [quality,test]
550+
python -m uv pip install -U bitsandbytes optimum_quanto
551+
python -m uv pip install pytest-reportlog
552+
- name: Environment
553+
run: |
554+
python utils/print_env.py
555+
- name: Pipeline-level quantization tests on GPU
556+
env:
557+
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
558+
# https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms
559+
CUBLAS_WORKSPACE_CONFIG: :16:8
560+
BIG_GPU_MEMORY: 40
561+
run: |
562+
python -m pytest -n 1 --max-worker-restart=0 --dist=loadfile \
563+
--make-reports=tests_pipeline_level_quant_torch_cuda \
564+
--report-log=tests_pipeline_level_quant_torch_cuda.log \
565+
tests/quantization/test_pipeline_level_quantization.py
566+
- name: Failure short reports
567+
if: ${{ failure() }}
568+
run: |
569+
cat reports/tests_pipeline_level_quant_torch_cuda_stats.txt
570+
cat reports/tests_pipeline_level_quant_torch_cuda_failures_short.txt
571+
- name: Test suite reports artifacts
572+
if: ${{ always() }}
573+
uses: actions/upload-artifact@v4
574+
with:
575+
name: torch_cuda_pipeline_level_quant_reports
576+
path: reports
577+
- name: Generate Report and Notify Channel
578+
if: always()
579+
run: |
580+
pip install slack_sdk tabulate
581+
python utils/log_reports.py >> $GITHUB_STEP_SUMMARY
582+
528583
# M1 runner currently not well supported
529584
# TODO: (Dhruv) add these back when we setup better testing for Apple Silicon
530585
# run_nightly_tests_apple_m1:

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -457,6 +457,8 @@
457457
title: Flux
458458
- local: api/pipelines/control_flux_inpaint
459459
title: FluxControlInpaint
460+
- local: api/pipelines/framepack
461+
title: Framepack
460462
- local: api/pipelines/hidream
461463
title: HiDream-I1
462464
- local: api/pipelines/hunyuandit
Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License. -->
14+
15+
# Framepack
16+
17+
<div class="flex flex-wrap space-x-1">
18+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
19+
</div>
20+
21+
[Packing Input Frame Context in Next-Frame Prediction Models for Video Generation](https://arxiv.org/abs/2504.12626) by Lvmin Zhang and Maneesh Agrawala.
22+
23+
*We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.*
24+
25+
<Tip>
26+
27+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
28+
29+
</Tip>
30+
31+
## Available models
32+
33+
| Model name | Description |
34+
|:---|:---|
35+
- [`lllyasviel/FramePackI2V_HY`](https://huggingface.co/lllyasviel/FramePackI2V_HY) | Trained with the "inverted anti-drifting" strategy as described in the paper. Inference requires setting `sampling_type="inverted_anti_drifting"` when running the pipeline. |
36+
- [`lllyasviel/FramePack_F1_I2V_HY_20250503`](https://huggingface.co/lllyasviel/FramePack_F1_I2V_HY_20250503) | Trained with a novel anti-drifting strategy but inference is performed in "vanilla" strategy as described in the paper. Inference requires setting `sampling_type="vanilla"` when running the pipeline. |
37+
38+
## Usage
39+
40+
Refer to the pipeline documentation for basic usage examples. The following section contains examples of offloading, different sampling methods, quantization, and more.
41+
42+
### First and last frame to video
43+
44+
The following example shows how to use Framepack with start and end image controls, using the inverted anti-drifiting sampling model.
45+
46+
```python
47+
import torch
48+
from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel
49+
from diffusers.utils import export_to_video, load_image
50+
from transformers import SiglipImageProcessor, SiglipVisionModel
51+
52+
transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained(
53+
"lllyasviel/FramePackI2V_HY", torch_dtype=torch.bfloat16
54+
)
55+
feature_extractor = SiglipImageProcessor.from_pretrained(
56+
"lllyasviel/flux_redux_bfl", subfolder="feature_extractor"
57+
)
58+
image_encoder = SiglipVisionModel.from_pretrained(
59+
"lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16
60+
)
61+
pipe = HunyuanVideoFramepackPipeline.from_pretrained(
62+
"hunyuanvideo-community/HunyuanVideo",
63+
transformer=transformer,
64+
feature_extractor=feature_extractor,
65+
image_encoder=image_encoder,
66+
torch_dtype=torch.float16,
67+
)
68+
69+
# Enable memory optimizations
70+
pipe.enable_model_cpu_offload()
71+
pipe.vae.enable_tiling()
72+
73+
prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
74+
first_image = load_image(
75+
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png"
76+
)
77+
last_image = load_image(
78+
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png"
79+
)
80+
output = pipe(
81+
image=first_image,
82+
last_image=last_image,
83+
prompt=prompt,
84+
height=512,
85+
width=512,
86+
num_frames=91,
87+
num_inference_steps=30,
88+
guidance_scale=9.0,
89+
generator=torch.Generator().manual_seed(0),
90+
sampling_type="inverted_anti_drifting",
91+
).frames[0]
92+
export_to_video(output, "output.mp4", fps=30)
93+
```
94+
95+
### Vanilla sampling
96+
97+
The following example shows how to use Framepack with the F1 model trained with vanilla sampling but new regulation approach for anti-drifting.
98+
99+
```python
100+
import torch
101+
from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel
102+
from diffusers.utils import export_to_video, load_image
103+
from transformers import SiglipImageProcessor, SiglipVisionModel
104+
105+
transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained(
106+
"lllyasviel/FramePack_F1_I2V_HY_20250503", torch_dtype=torch.bfloat16
107+
)
108+
feature_extractor = SiglipImageProcessor.from_pretrained(
109+
"lllyasviel/flux_redux_bfl", subfolder="feature_extractor"
110+
)
111+
image_encoder = SiglipVisionModel.from_pretrained(
112+
"lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16
113+
)
114+
pipe = HunyuanVideoFramepackPipeline.from_pretrained(
115+
"hunyuanvideo-community/HunyuanVideo",
116+
transformer=transformer,
117+
feature_extractor=feature_extractor,
118+
image_encoder=image_encoder,
119+
torch_dtype=torch.float16,
120+
)
121+
122+
# Enable memory optimizations
123+
pipe.enable_model_cpu_offload()
124+
pipe.vae.enable_tiling()
125+
126+
image = load_image(
127+
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png"
128+
)
129+
output = pipe(
130+
image=image,
131+
prompt="A penguin dancing in the snow",
132+
height=832,
133+
width=480,
134+
num_frames=91,
135+
num_inference_steps=30,
136+
guidance_scale=9.0,
137+
generator=torch.Generator().manual_seed(0),
138+
sampling_type="vanilla",
139+
).frames[0]
140+
export_to_video(output, "output.mp4", fps=30)
141+
```
142+
143+
### Group offloading
144+
145+
Group offloading ([`~hooks.apply_group_offloading`]) provides aggressive memory optimizations for offloading internal parts of any model to the CPU, with possibly no additional overhead to generation time. If you have very low VRAM available, this approach may be suitable for you depending on the amount of CPU RAM available.
146+
147+
```python
148+
import torch
149+
from diffusers import HunyuanVideoFramepackPipeline, HunyuanVideoFramepackTransformer3DModel
150+
from diffusers.hooks import apply_group_offloading
151+
from diffusers.utils import export_to_video, load_image
152+
from transformers import SiglipImageProcessor, SiglipVisionModel
153+
154+
transformer = HunyuanVideoFramepackTransformer3DModel.from_pretrained(
155+
"lllyasviel/FramePack_F1_I2V_HY_20250503", torch_dtype=torch.bfloat16
156+
)
157+
feature_extractor = SiglipImageProcessor.from_pretrained(
158+
"lllyasviel/flux_redux_bfl", subfolder="feature_extractor"
159+
)
160+
image_encoder = SiglipVisionModel.from_pretrained(
161+
"lllyasviel/flux_redux_bfl", subfolder="image_encoder", torch_dtype=torch.float16
162+
)
163+
pipe = HunyuanVideoFramepackPipeline.from_pretrained(
164+
"hunyuanvideo-community/HunyuanVideo",
165+
transformer=transformer,
166+
feature_extractor=feature_extractor,
167+
image_encoder=image_encoder,
168+
torch_dtype=torch.float16,
169+
)
170+
171+
# Enable group offloading
172+
onload_device = torch.device("cuda")
173+
offload_device = torch.device("cpu")
174+
list(map(
175+
lambda x: apply_group_offloading(x, onload_device, offload_device, offload_type="leaf_level", use_stream=True, low_cpu_mem_usage=True),
176+
[pipe.text_encoder, pipe.text_encoder_2, pipe.transformer]
177+
))
178+
pipe.image_encoder.to(onload_device)
179+
pipe.vae.to(onload_device)
180+
pipe.vae.enable_tiling()
181+
182+
image = load_image(
183+
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png"
184+
)
185+
output = pipe(
186+
image=image,
187+
prompt="A penguin dancing in the snow",
188+
height=832,
189+
width=480,
190+
num_frames=91,
191+
num_inference_steps=30,
192+
guidance_scale=9.0,
193+
generator=torch.Generator().manual_seed(0),
194+
sampling_type="vanilla",
195+
).frames[0]
196+
print(f"Max memory: {torch.cuda.max_memory_allocated() / 1024**3:.3f} GB")
197+
export_to_video(output, "output.mp4", fps=30)
198+
```
199+
200+
## HunyuanVideoFramepackPipeline
201+
202+
[[autodoc]] HunyuanVideoFramepackPipeline
203+
- all
204+
- __call__
205+
206+
## HunyuanVideoPipelineOutput
207+
208+
[[autodoc]] pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput
209+

docs/source/en/api/pipelines/hunyuan_video.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,6 @@ The following models are available for the image-to-video pipeline:
5252
| [`Skywork/SkyReels-V1-Hunyuan-I2V`](https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-I2V) | Skywork's custom finetune of HunyuanVideo (de-distilled). Performs best with `97x544x960` resolution. Performs best at `97x544x960` resolution, `guidance_scale=1.0`, `true_cfg_scale=6.0` and a negative prompt. |
5353
| [`hunyuanvideo-community/HunyuanVideo-I2V-33ch`](https://huggingface.co/hunyuanvideo-community/HunyuanVideo-I2V) | Tecent's official HunyuanVideo 33-channel I2V model. Performs best at resolutions of 480, 720, 960, 1280. A higher `shift` value when initializing the scheduler is recommended (good values are between 7 and 20). |
5454
| [`hunyuanvideo-community/HunyuanVideo-I2V`](https://huggingface.co/hunyuanvideo-community/HunyuanVideo-I2V) | Tecent's official HunyuanVideo 16-channel I2V model. Performs best at resolutions of 480, 720, 960, 1280. A higher `shift` value when initializing the scheduler is recommended (good values are between 7 and 20) |
55-
- [`lllyasviel/FramePackI2V_HY`](https://huggingface.co/lllyasviel/FramePackI2V_HY) | lllyasviel's paper introducing a new technique for long-context video generation called [Framepack](https://arxiv.org/abs/2504.12626). |
5655

5756
## Quantization
5857

docs/source/en/api/quantization.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,16 +13,17 @@ specific language governing permissions and limitations under the License.
1313

1414
# Quantization
1515

16-
Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn't be able to fit into memory, and speeding up inference. Diffusers supports 8-bit and 4-bit quantization with [bitsandbytes](https://huggingface.co/docs/bitsandbytes/en/index).
17-
18-
Quantization techniques that aren't supported in Transformers can be added with the [`DiffusersQuantizer`] class.
16+
Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn't be able to fit into memory, and speeding up inference.
1917

2018
<Tip>
2119

2220
Learn how to quantize models in the [Quantization](../quantization/overview) guide.
2321

2422
</Tip>
2523

24+
## PipelineQuantizationConfig
25+
26+
[[autodoc]] quantizers.PipelineQuantizationConfig
2627

2728
## BitsAndBytesConfig
2829

0 commit comments

Comments
 (0)