Skip to content

Commit f6cb65c

Browse files
committed
add other vid models
1 parent 704dbb4 commit f6cb65c

File tree

6 files changed

+198
-10
lines changed

6 files changed

+198
-10
lines changed

docs/source/en/api/pipelines/aura_flow.md

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
1212

1313
# AuraFlow
1414

15-
AuraFlow is inspired by [Stable Diffusion 3](../pipelines/stable_diffusion/stable_diffusion_3.md) and is by far the largest text-to-image generation model that comes with an Apache 2.0 license. This model achieves state-of-the-art results on the [GenEval](https://github.com/djghosh13/geneval) benchmark.
15+
AuraFlow is inspired by [Stable Diffusion 3](../pipelines/stable_diffusion/stable_diffusion_3) and is by far the largest text-to-image generation model that comes with an Apache 2.0 license. This model achieves state-of-the-art results on the [GenEval](https://github.com/djghosh13/geneval) benchmark.
1616

1717
It was developed by the Fal team and more details about it can be found in [this blog post](https://blog.fal.ai/auraflow/).
1818

@@ -22,6 +22,47 @@ AuraFlow can be quite expensive to run on consumer hardware devices. However, yo
2222

2323
</Tip>
2424

25+
## Quantization
26+
27+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
28+
29+
Refer to the [Quantization](../../quantization/overview) to learn more about supported quantization backends (bitsandbytes, torchao, gguf) and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`AuraFlowPipeline`] for inference with bitsandbytes.
30+
31+
```py
32+
import torch
33+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, AuraFlowTransformer2DModel, AuraFlowPipeline
34+
from diffusers.utils import export_to_video
35+
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
36+
37+
quant_config = BitsAndBytesConfig(load_in_8bit=True)
38+
text_encoder_8bit = T5EncoderModel.from_pretrained(
39+
"fal/AuraFlow",
40+
subfolder="text_encoder",
41+
quantization_config=quant_config,
42+
torch_dtype=torch.float16,
43+
)
44+
45+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
46+
transformer_8bit = AuraFlowTransformer2DModel.from_pretrained(
47+
"fal/AuraFlow",
48+
subfolder="transformer",
49+
quantization_config=quant_config,
50+
torch_dtype=torch.float16,
51+
)
52+
53+
pipeline = AuraFlowPipeline.from_pretrained(
54+
"fal/AuraFlow",
55+
text_encoder=text_encoder_8bit,
56+
transformer=transformer_8bit,
57+
torch_dtype=torch.float16,
58+
device_map="balanced",
59+
)
60+
61+
prompt = "A refreshing scene where a glass of freshly squeezed orange juice stands prominently at the center, bathed in warm, golden sunlight that highlights the vibrant, citrus hues of the juice. The glass is intricately detailed, showing condensation droplets that glisten like tiny jewels. Surrounding the base of the glass, scattered orange slices and lush green leaves add a touch of natural beauty and freshness. Above the glass, a dynamic splash of orange juice is captured mid-air, forming the word 'Orange' in a fluid, playful script. The splash is so vivid and realistic that each droplet seems to dance in the air, creating a sense of movement and energy. In the background, a serene orchard with rows of orange trees stretches out under a clear blue sky, their branches heavy with ripe oranges ready for harvest. Rays of sunlight filter through the leaves, casting dappled shadows on the ground. A gentle breeze rustles the leaves, adding a sense of calm and tranquility to the scene. The entire scene evokes a sense of purity, freshness, and vitality, inviting viewers to experience the simple joy of a glass of fresh orange juice."
62+
image = pipeline(prompt).images[0]
63+
image.save("auraflow.png")
64+
```
65+
2566
## AuraFlowPipeline
2667

2768
[[autodoc]] AuraFlowPipeline

docs/source/en/api/pipelines/cogvideox.md

Lines changed: 38 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -112,13 +112,46 @@ CogVideoX-2b requires about 19 GB of GPU memory to decode 49 frames (6 seconds o
112112
- With enabling cpu offloading and tiling, memory usage is `11 GB`
113113
- `pipe.vae.enable_slicing()`
114114

115-
### Quantized inference
115+
## Quantization
116116

117-
[torchao](https://github.com/pytorch/ao) and [optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be used to quantize the text encoder, transformer and VAE modules to lower the memory requirements. This makes it possible to run the model on a free-tier T4 Colab or lower VRAM GPUs!
117+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
118118

119-
It is also worth noting that torchao quantization is fully compatible with [torch.compile](/optimization/torch2.0#torchcompile), which allows for much faster inference speed. Additionally, models can be serialized and stored in a quantized datatype to save disk space with torchao. Find examples and benchmarks in the gists below.
120-
- [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
121-
- [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
119+
Refer to the [Quantization](../../quantization/overview) to learn more about supported quantization backends (bitsandbytes, torchao, gguf) and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`CogVideoXPipeline`] for inference with bitsandbytes.
120+
121+
```py
122+
import torch
123+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, CogVideoXTransformer3DModel, CogVideoXPipeline
124+
from diffusers.utils import export_to_video
125+
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
126+
127+
quant_config = BitsAndBytesConfig(load_in_8bit=True)
128+
text_encoder_8bit = T5EncoderModel.from_pretrained(
129+
"THUDM/CogVideoX-2b",
130+
subfolder="text_encoder",
131+
quantization_config=quant_config,
132+
torch_dtype=torch.float16,
133+
)
134+
135+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
136+
transformer_8bit = CogVideoXTransformer3DModel.from_pretrained(
137+
"THUDM/CogVideoX-2b",
138+
subfolder="transformer",
139+
quantization_config=quant_config,
140+
torch_dtype=torch.float16,
141+
)
142+
143+
pipeline = CogVideoXPipeline.from_pretrained(
144+
"THUDM/CogVideoX-2b",
145+
text_encoder=text_encoder_8bit,
146+
transformer=transformer_8bit,
147+
torch_dtype=torch.float16,
148+
device_map="balanced",
149+
)
150+
151+
prompt = "A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting."
152+
video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
153+
export_to_video(video, "ship.mp4", fps=8)
154+
```
122155

123156
## CogVideoXPipeline
124157

docs/source/en/api/pipelines/flux.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -334,6 +334,48 @@ out = pipe(
334334
out.save("image.png")
335335
```
336336

337+
## Quantization
338+
339+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
340+
341+
Refer to the [Quantization](../../quantization/overview) to learn more about supported quantization backends (bitsandbytes, torchao, gguf) and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`FluxPipeline`] for inference with bitsandbytes.
342+
343+
```py
344+
import torch
345+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, FluxTransformer2DModel, FluxPipeline
346+
from diffusers.utils import export_to_video
347+
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
348+
349+
quant_config = BitsAndBytesConfig(load_in_8bit=True)
350+
text_encoder_8bit = T5EncoderModel.from_pretrained(
351+
"black-forest-labs/FLUX.1-dev",
352+
subfolder="text_encoder_2",
353+
quantization_config=quant_config,
354+
torch_dtype=torch.float16,
355+
)
356+
357+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
358+
transformer_8bit = FluxTransformer2DModel.from_pretrained(
359+
"black-forest-labs/FLUX.1-dev",
360+
subfolder="transformer",
361+
quantization_config=quant_config,
362+
torch_dtype=torch.float16,
363+
)
364+
365+
pipeline = FluxPipeline.from_pretrained(
366+
"black-forest-labs/FLUX.1-dev",
367+
text_encoder=text_encoder_8bit,
368+
transformer=transformer_8bit,
369+
torch_dtype=torch.float16,
370+
device_map="balanced",
371+
)
372+
373+
prompt = "A refreshing scene where a glass of freshly squeezed orange juice stands prominently at the center, bathed in warm, golden sunlight that highlights the vibrant, citrus hues of the juice. The glass is intricately detailed, showing condensation droplets that glisten like tiny jewels. Surrounding the base of the glass, scattered orange slices and lush green leaves add a touch of natural beauty and freshness. Above the glass, a dynamic splash of orange juice is captured mid-air, forming the word 'Orange' in a fluid, playful script. The splash is so vivid and realistic that each droplet seems to dance in the air, creating a sense of movement and energy. In the background, a serene orchard with rows of orange trees stretches out under a clear blue sky, their branches heavy with ripe oranges ready for harvest. Rays of sunlight filter through the leaves, casting dappled shadows on the ground. A gentle breeze rustles the leaves, adding a sense of calm and tranquility to the scene. The entire scene evokes a sense of purity, freshness, and vitality, inviting viewers to experience the simple joy of a glass of fresh orange juice."
374+
375+
image = pipeline(prompt, guidance_scale=3.5, height=768, width=1360, num_inference_steps=50).images[0]
376+
image.save("flux.png")
377+
```
378+
337379
## Single File Loading for the `FluxTransformer2DModel`
338380

339381
The `FluxTransformer2DModel` supports loading checkpoints in the original format shipped by Black Forest Labs. This is also useful when trying to load finetunes or quantized versions of the models that have been published by the community.

docs/source/en/api/pipelines/hunyuan_video.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,37 @@ Recommendations for inference:
3232
- For smaller resolution images, try lower values of `shift` (between `2.0` to `5.0`) in the [Scheduler](https://huggingface.co/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler.shift). For larger resolution images, try higher values (between `7.0` and `12.0`). The default value is `7.0` for HunyuanVideo.
3333
- For more information about supported resolutions and other details, please refer to the original repository [here](https://github.com/Tencent/HunyuanVideo/).
3434

35+
## Quantization
36+
37+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
38+
39+
Refer to the [Quantization](../../quantization/overview) to learn more about supported quantization backends (bitsandbytes, torchao, gguf) and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`HunyuanVideoPipeline`] for inference with bitsandbytes.
40+
41+
```py
42+
import torch
43+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
44+
from diffusers.utils import export_to_video
45+
46+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
47+
transformer_8bit = HunyuanVideoTransformer3DModel.from_pretrained(
48+
"tencent/HunyuanVideo",
49+
subfolder="transformer",
50+
quantization_config=quant_config,
51+
torch_dtype=torch.float16,
52+
)
53+
54+
pipeline = HunyuanVideoPipeline.from_pretrained(
55+
"tencent/HunyuanVideo",
56+
transformer=transformer_8bit,
57+
torch_dtype=torch.float16,
58+
device_map="balanced",
59+
)
60+
61+
prompt = "A cat walks on the grass, realistic style."
62+
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
63+
export_to_video(video, "cat.mp4", fps=15)
64+
```
65+
3566
## HunyuanVideoPipeline
3667

3768
[[autodoc]] HunyuanVideoPipeline

docs/source/en/api/pipelines/mochi.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,9 @@
2727
2828
## Quantization
2929

30-
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. Refer to the [Quantization](../../quantization/overview) to learn more about supported quantization backends and selecting a quantization backend that supports your use case.
30+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
3131

32-
The example below demonstrates how to load a quantized [`MochiPipeline`] for inference with bitsandbytes.
32+
Refer to the [Quantization](../../quantization/overview) to learn more about supported quantization backends (bitsandbytes, torchao, gguf) and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`MochiPipeline`] for inference with bitsandbytes.
3333

3434
```py
3535
import torch
@@ -61,12 +61,12 @@ pipeline = MochiPipeline.from_pretrained(
6161
device_map="balanced",
6262
)
6363

64-
frames = pipeline(
64+
video = pipeline(
6565
"Close-up of a cats eye, with the galaxy reflected in the cats eye. Ultra high resolution 4k.",
6666
num_inference_steps=28,
6767
guidance_scale=3.5
6868
).frames[0]
69-
export_to_video(frames, "cat.mp4")
69+
export_to_video(video, "cat.mp4")
7070
```
7171

7272
## Generating videos with Mochi-1 Preview

docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_3.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -268,6 +268,47 @@ image.save("sd3_hello_world.png")
268268

269269
Check out the full script [here](https://gist.github.com/sayakpaul/508d89d7aad4f454900813da5d42ca97).
270270

271+
## Quantization
272+
273+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
274+
275+
Refer to the [Quantization](../../quantization/overview) to learn more about supported quantization backends (bitsandbytes, torchao, gguf) and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`StableDiffusion3Pipeline`] for inference with bitsandbytes.
276+
277+
```py
278+
import torch
279+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, SD3Transformer2DModel, StableDiffusion3Pipeline
280+
from diffusers.utils import export_to_video
281+
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
282+
283+
quant_config = BitsAndBytesConfig(load_in_8bit=True)
284+
text_encoder_8bit = T5EncoderModel.from_pretrained(
285+
"stabilityai/stable-diffusion-3.5-large",
286+
subfolder="text_encoder_3",
287+
quantization_config=quant_config,
288+
torch_dtype=torch.float16,
289+
)
290+
291+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
292+
transformer_8bit = SD3Transformer2DModel.from_pretrained(
293+
"stabilityai/stable-diffusion-3.5-large",
294+
subfolder="transformer",
295+
quantization_config=quant_config,
296+
torch_dtype=torch.float16,
297+
)
298+
299+
pipeline = StableDiffusion3Pipeline.from_pretrained(
300+
"stabilityai/stable-diffusion-3.5-large",
301+
text_encoder=text_encoder_8bit,
302+
transformer=transformer_8bit,
303+
torch_dtype=torch.float16,
304+
device_map="balanced",
305+
)
306+
307+
prompt = "A refreshing scene where a glass of freshly squeezed orange juice stands prominently at the center, bathed in warm, golden sunlight that highlights the vibrant, citrus hues of the juice. The glass is intricately detailed, showing condensation droplets that glisten like tiny jewels. Surrounding the base of the glass, scattered orange slices and lush green leaves add a touch of natural beauty and freshness. Above the glass, a dynamic splash of orange juice is captured mid-air, forming the word 'Orange' in a fluid, playful script. The splash is so vivid and realistic that each droplet seems to dance in the air, creating a sense of movement and energy. In the background, a serene orchard with rows of orange trees stretches out under a clear blue sky, their branches heavy with ripe oranges ready for harvest. Rays of sunlight filter through the leaves, casting dappled shadows on the ground. A gentle breeze rustles the leaves, adding a sense of calm and tranquility to the scene. The entire scene evokes a sense of purity, freshness, and vitality, inviting viewers to experience the simple joy of a glass of fresh orange juice."
308+
image = pipeline(prompt, num_inference_steps=28, guidance_scale=7.0).images[0]
309+
image.save("sd3.png")
310+
```
311+
271312
## Using Long Prompts with the T5 Text Encoder
272313

273314
By default, the T5 Text Encoder prompt uses a maximum sequence length of `256`. This can be adjusted by setting the `max_sequence_length` to accept fewer or more tokens. Keep in mind that longer sequences require additional resources and result in longer generation times, such as during batch inference.

0 commit comments

Comments
 (0)