Skip to content

Commit 91008aa

Browse files
stevhliusayakpaul
andauthored
[docs] Video generation update (#10272)
* update * update * feedback * fix videos * use previous checkpoint --------- Co-authored-by: Sayak Paul <[email protected]>
1 parent 0744378 commit 91008aa

File tree

2 files changed

+124
-112
lines changed

2 files changed

+124
-112
lines changed

docs/source/en/_toctree.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@
4848
- local: using-diffusers/inpaint
4949
title: Inpainting
5050
- local: using-diffusers/text-img2vid
51-
title: Text or image-to-video
51+
title: Video generation
5252
- local: using-diffusers/depth2img
5353
title: Depth-to-image
5454
title: Generative tasks

docs/source/en/using-diffusers/text-img2vid.md

Lines changed: 123 additions & 111 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
22
33
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
44
the License. You may obtain a copy of the License at
@@ -10,44 +10,34 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
1010
specific language governing permissions and limitations under the License.
1111
-->
1212

13-
# Text or image-to-video
13+
# Video generation
1414

15-
Driven by the success of text-to-image diffusion models, generative video models are able to generate short clips of video from a text prompt or an initial image. These models extend a pretrained diffusion model to generate videos by adding some type of temporal and/or spatial convolution layer to the architecture. A mixed dataset of images and videos are used to train the model which learns to output a series of video frames based on the text or image conditioning.
15+
Video generation models include a temporal dimension to bring images, or frames, together to create a video. These models are trained on large-scale datasets of high-quality text-video pairs to learn how to combine the modalities to ensure the generated video is coherent and realistic.
1616

17-
This guide will show you how to generate videos, how to configure video model parameters, and how to control video generation.
17+
[Explore](https://huggingface.co/models?other=video-generation) some of the more popular open-source video generation models available from Diffusers below.
1818

19-
## Popular models
19+
<hfoptions id="popular-models">
20+
<hfoption id="CogVideoX">
2021

21-
> [!TIP]
22-
> Discover other cool and trending video generation models on the Hub [here](https://huggingface.co/models?pipeline_tag=text-to-video&sort=trending)!
23-
24-
[Stable Video Diffusions (SVD)](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid), [I2VGen-XL](https://huggingface.co/ali-vilab/i2vgen-xl/), [AnimateDiff](https://huggingface.co/guoyww/animatediff), and [ModelScopeT2V](https://huggingface.co/ali-vilab/text-to-video-ms-1.7b) are popular models used for video diffusion. Each model is distinct. For example, AnimateDiff inserts a motion modeling module into a frozen text-to-image model to generate personalized animated images, whereas SVD is entirely pretrained from scratch with a three-stage training process to generate short high-quality videos.
22+
[CogVideoX](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) uses a 3D causal Variational Autoencoder (VAE) to compress videos along the spatial and temporal dimensions, and it includes a stack of expert transformer blocks with a 3D full attention mechanism to better capture visual, semantic, and motion information in the data.
2523

26-
[CogVideoX](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) is another popular video generation model. The model is a multidimensional transformer that integrates text, time, and space. It employs full attention in the attention module and includes an expert block at the layer level to spatially align text and video.
24+
The CogVideoX family also includes models capable of generating videos from images and videos in addition to text. The image-to-video models are indicated by **I2V** in the checkpoint name, and they should be used with the [`CogVideoXImageToVideoPipeline`]. The regular checkpoints support video-to-video through the [`CogVideoXVideoToVideoPipeline`].
2725

28-
### CogVideoX
29-
30-
[CogVideoX](../api/pipelines/cogvideox) uses a 3D Variational Autoencoder (VAE) to compress videos along the spatial and temporal dimensions.
31-
32-
Begin by loading the [`CogVideoXPipeline`] and passing an initial text or image to generate a video.
33-
<Tip>
34-
35-
CogVideoX is available for image-to-video and text-to-video. [THUDM/CogVideoX-5b-I2V](https://huggingface.co/THUDM/CogVideoX-5b-I2V) uses the [`CogVideoXImageToVideoPipeline`] for image-to-video. [THUDM/CogVideoX-5b](https://huggingface.co/THUDM/CogVideoX-5b) and [THUDM/CogVideoX-2b](https://huggingface.co/THUDM/CogVideoX-2b) are available for text-to-video with the [`CogVideoXPipeline`].
36-
37-
</Tip>
26+
The example below demonstrates how to generate a video from an image and text prompt with [THUDM/CogVideoX-5b-I2V](https://huggingface.co/THUDM/CogVideoX-5b-I2V).
3827

3928
```py
4029
import torch
4130
from diffusers import CogVideoXImageToVideoPipeline
4231
from diffusers.utils import export_to_video, load_image
4332

4433
prompt = "A vast, shimmering ocean flows gracefully under a twilight sky, its waves undulating in a mesmerizing dance of blues and greens. The surface glints with the last rays of the setting sun, casting golden highlights that ripple across the water. Seagulls soar above, their cries blending with the gentle roar of the waves. The horizon stretches infinitely, where the ocean meets the sky in a seamless blend of hues. Close-ups reveal the intricate patterns of the waves, capturing the fluidity and dynamic beauty of the sea in motion."
45-
image = load_image(image="cogvideox_rocket.png")
34+
image = load_image(image="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cogvideox/cogvideox_rocket.png")
4635
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
4736
"THUDM/CogVideoX-5b-I2V",
4837
torch_dtype=torch.bfloat16
4938
)
50-
39+
40+
# reduce memory requirements
5141
pipe.vae.enable_tiling()
5242
pipe.vae.enable_slicing()
5343

@@ -60,7 +50,6 @@ video = pipe(
6050
guidance_scale=6,
6151
generator=torch.Generator(device="cuda").manual_seed(42),
6252
).frames[0]
63-
6453
export_to_video(video, "output.mp4", fps=8)
6554
```
6655

@@ -75,102 +64,148 @@ export_to_video(video, "output.mp4", fps=8)
7564
</div>
7665
</div>
7766

78-
79-
### Stable Video Diffusion
67+
</hfoption>
68+
<hfoption id="HunyuanVideo">
8069

81-
[SVD](../api/pipelines/svd) is based on the Stable Diffusion 2.1 model and it is trained on images, then low-resolution videos, and finally a smaller dataset of high-resolution videos. This model generates a short 2-4 second video from an initial image. You can learn more details about model, like micro-conditioning, in the [Stable Video Diffusion](../using-diffusers/svd) guide.
70+
> [!TIP]
71+
> HunyuanVideo is a 13B parameter model and requires a lot of memory. Refer to the HunyuanVideo [Quantization](../api/pipelines/hunyuan_video#quantization) guide to learn how to quantize the model. CogVideoX and LTX-Video are more lightweight options that can still generate high-quality videos.
8272
83-
Begin by loading the [`StableVideoDiffusionPipeline`] and passing an initial image to generate a video from.
73+
[HunyuanVideo](https://huggingface.co/tencent/HunyuanVideo) features a dual-stream to single-stream diffusion transformer (DiT) for learning video and text tokens separately, and then subsequently concatenating the video and text tokens to combine their information. A single multimodal large language model (MLLM) serves as the text encoder, and videos are also spatio-temporally compressed with a 3D causal VAE.
8474

8575
```py
8676
import torch
87-
from diffusers import StableVideoDiffusionPipeline
88-
from diffusers.utils import load_image, export_to_video
77+
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
78+
from diffusers.utils import export_to_video
8979

90-
pipeline = StableVideoDiffusionPipeline.from_pretrained(
91-
"stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
80+
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
81+
"tencent/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16
82+
)
83+
pipe = HunyuanVideoPipeline.from_pretrained(
84+
"tencent/HunyuanVideo", transformer=transformer, torch_dtype=torch.float16
9285
)
93-
pipeline.enable_model_cpu_offload()
9486

95-
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
96-
image = image.resize((1024, 576))
87+
# reduce memory requirements
88+
pipe.vae.enable_tiling()
89+
pipe.to("cuda")
9790

98-
generator = torch.manual_seed(42)
99-
frames = pipeline(image, decode_chunk_size=8, generator=generator).frames[0]
100-
export_to_video(frames, "generated.mp4", fps=7)
91+
video = pipe(
92+
prompt="A cat walks on the grass, realistic",
93+
height=320,
94+
width=512,
95+
num_frames=61,
96+
num_inference_steps=30,
97+
).frames[0]
98+
export_to_video(video, "output.mp4", fps=15)
10199
```
102100

103-
<div class="flex gap-4">
104-
<div>
105-
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png"/>
106-
<figcaption class="mt-2 text-center text-sm text-gray-500">initial image</figcaption>
107-
</div>
108-
<div>
109-
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket.gif"/>
110-
<figcaption class="mt-2 text-center text-sm text-gray-500">generated video</figcaption>
111-
</div>
101+
<div class="flex justify-center">
102+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hunyuan-video-output.gif"/>
112103
</div>
113104

114-
### I2VGen-XL
115-
116-
[I2VGen-XL](../api/pipelines/i2vgenxl) is a diffusion model that can generate higher resolution videos than SVD and it is also capable of accepting text prompts in addition to images. The model is trained with two hierarchical encoders (detail and global encoder) to better capture low and high-level details in images. These learned details are used to train a video diffusion model which refines the video resolution and details in the generated video.
105+
</hfoption>
106+
<hfoption id="LTX-Video">
117107

118-
You can use I2VGen-XL by loading the [`I2VGenXLPipeline`], and passing a text and image prompt to generate a video.
108+
[LTX-Video (LTXV)](https://huggingface.co/Lightricks/LTX-Video) is a diffusion transformer (DiT) with a focus on speed. It generates 768x512 resolution videos at 24 frames per second (fps), enabling near real-time generation of high-quality videos. LTXV is relatively lightweight compared to other modern video generation models, making it possible to run on consumer GPUs.
119109

120110
```py
121111
import torch
122-
from diffusers import I2VGenXLPipeline
123-
from diffusers.utils import export_to_gif, load_image
124-
125-
pipeline = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", torch_dtype=torch.float16, variant="fp16")
126-
pipeline.enable_model_cpu_offload()
127-
128-
image_url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/i2vgen_xl_images/img_0009.png"
129-
image = load_image(image_url).convert("RGB")
112+
from diffusers import LTXPipeline
113+
from diffusers.utils import export_to_video
130114

131-
prompt = "Papers were floating in the air on a table in the library"
132-
negative_prompt = "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms"
133-
generator = torch.manual_seed(8888)
115+
pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16).to("cuda")
134116

135-
frames = pipeline(
117+
prompt = "A man walks towards a window, looks out, and then turns around. He has short, dark hair, dark skin, and is wearing a brown coat over a red and gray scarf. He walks from left to right towards a window, his gaze fixed on something outside. The camera follows him from behind at a medium distance. The room is brightly lit, with white walls and a large window covered by a white curtain. As he approaches the window, he turns his head slightly to the left, then back to the right. He then turns his entire body to the right, facing the window. The camera remains stationary as he stands in front of the window. The scene is captured in real-life footage."
118+
video = pipe(
136119
prompt=prompt,
137-
image=image,
120+
width=704,
121+
height=480,
122+
num_frames=161,
138123
num_inference_steps=50,
139-
negative_prompt=negative_prompt,
140-
guidance_scale=9.0,
141-
generator=generator
142124
).frames[0]
143-
export_to_gif(frames, "i2v.gif")
125+
export_to_video(video, "output.mp4", fps=24)
126+
```
127+
128+
<div class="flex justify-center">
129+
<img src="https://huggingface.co/Lightricks/LTX-Video/resolve/main/media/ltx-video_example_00014.gif"/>
130+
</div>
131+
132+
</hfoption>
133+
<hfoption id="Mochi-1">
134+
135+
> [!TIP]
136+
> Mochi-1 is a 10B parameter model and requires a lot of memory. Refer to the Mochi [Quantization](../api/pipelines/mochi#quantization) guide to learn how to quantize the model. CogVideoX and LTX-Video are more lightweight options that can still generate high-quality videos.
137+
138+
[Mochi-1](https://huggingface.co/genmo/mochi-1-preview) introduces the Asymmetric Diffusion Transformer (AsymmDiT) and Asymmetric Variational Autoencoder (AsymmVAE) to reduces memory requirements. AsymmVAE causally compresses videos 128x to improve memory efficiency, and AsymmDiT jointly attends to the compressed video tokens and user text tokens. This model is noted for generating videos with high-quality motion dynamics and strong prompt adherence.
139+
140+
```py
141+
import torch
142+
from diffusers import MochiPipeline
143+
from diffusers.utils import export_to_video
144+
145+
pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview", variant="bf16", torch_dtype=torch.bfloat16)
146+
147+
# reduce memory requirements
148+
pipe.enable_model_cpu_offload()
149+
pipe.enable_vae_tiling()
150+
151+
prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
152+
video = pipe(prompt, num_frames=84).frames[0]
153+
export_to_video(video, "output.mp4", fps=30)
154+
```
155+
156+
<div class="flex justify-center">
157+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/mochi-video-output.gif"/>
158+
</div>
159+
160+
</hfoption>
161+
<hfoption id="StableVideoDiffusion">
162+
163+
[StableVideoDiffusion (SVD)](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) is based on the Stable Diffusion 2.1 model and it is trained on images, then low-resolution videos, and finally a smaller dataset of high-resolution videos. This model generates a short 2-4 second video from an initial image.
164+
165+
```py
166+
import torch
167+
from diffusers import StableVideoDiffusionPipeline
168+
from diffusers.utils import load_image, export_to_video
169+
170+
pipeline = StableVideoDiffusionPipeline.from_pretrained(
171+
"stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
172+
)
173+
174+
# reduce memory requirements
175+
pipeline.enable_model_cpu_offload()
176+
177+
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
178+
image = image.resize((1024, 576))
179+
180+
generator = torch.manual_seed(42)
181+
frames = pipeline(image, decode_chunk_size=8, generator=generator).frames[0]
182+
export_to_video(frames, "generated.mp4", fps=7)
144183
```
145184

146185
<div class="flex gap-4">
147186
<div>
148-
<img class="rounded-xl" src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/i2vgen_xl_images/img_0009.png"/>
187+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png"/>
149188
<figcaption class="mt-2 text-center text-sm text-gray-500">initial image</figcaption>
150189
</div>
151190
<div>
152-
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/i2vgen-xl-example.gif"/>
191+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket.gif"/>
153192
<figcaption class="mt-2 text-center text-sm text-gray-500">generated video</figcaption>
154193
</div>
155194
</div>
156195

157-
### AnimateDiff
196+
</hfoption>
197+
<hfoption id="AnimateDiff">
158198

159-
[AnimateDiff](../api/pipelines/animatediff) is an adapter model that inserts a motion module into a pretrained diffusion model to animate an image. The adapter is trained on video clips to learn motion which is used to condition the generation process to create a video. It is faster and easier to only train the adapter and it can be loaded into most diffusion models, effectively turning them into "video models".
199+
[AnimateDiff](https://huggingface.co/guoyww/animatediff) is an adapter model that inserts a motion module into a pretrained diffusion model to animate an image. The adapter is trained on video clips to learn motion which is used to condition the generation process to create a video. It is faster and easier to only train the adapter and it can be loaded into most diffusion models, effectively turning them into video models.
160200

161-
Start by loading a [`MotionAdapter`].
201+
Load a `MotionAdapter` and pass it to the [`AnimateDiffPipeline`].
162202

163203
```py
164204
import torch
165205
from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter
166206
from diffusers.utils import export_to_gif
167207

168208
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
169-
```
170-
171-
Then load a finetuned Stable Diffusion model with the [`AnimateDiffPipeline`].
172-
173-
```py
174209
pipeline = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter, torch_dtype=torch.float16)
175210
scheduler = DDIMScheduler.from_pretrained(
176211
"emilianJR/epiCRealism",
@@ -181,13 +216,11 @@ scheduler = DDIMScheduler.from_pretrained(
181216
steps_offset=1,
182217
)
183218
pipeline.scheduler = scheduler
219+
220+
# reduce memory requirements
184221
pipeline.enable_vae_slicing()
185222
pipeline.enable_model_cpu_offload()
186-
```
187223

188-
Create a prompt and generate the video.
189-
190-
```py
191224
output = pipeline(
192225
prompt="A space rocket with trails of smoke behind it launching into space from the desert, 4k, high resolution",
193226
negative_prompt="bad quality, worse quality, low resolution",
@@ -201,38 +234,11 @@ export_to_gif(frames, "animation.gif")
201234
```
202235

203236
<div class="flex justify-center">
204-
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff.gif"/>
237+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff.gif"/>
205238
</div>
206239

207-
### ModelscopeT2V
208-
209-
[ModelscopeT2V](../api/pipelines/text_to_video) adds spatial and temporal convolutions and attention to a UNet, and it is trained on image-text and video-text datasets to enhance what it learns during training. The model takes a prompt, encodes it and creates text embeddings which are denoised by the UNet, and then decoded by a VQGAN into a video.
210-
211-
<Tip>
212-
213-
ModelScopeT2V generates watermarked videos due to the datasets it was trained on. To use a watermark-free model, try the [cerspense/zeroscope_v2_76w](https://huggingface.co/cerspense/zeroscope_v2_576w) model with the [`TextToVideoSDPipeline`] first, and then upscale it's output with the [cerspense/zeroscope_v2_XL](https://huggingface.co/cerspense/zeroscope_v2_XL) checkpoint using the [`VideoToVideoSDPipeline`].
214-
215-
</Tip>
216-
217-
Load a ModelScopeT2V checkpoint into the [`DiffusionPipeline`] along with a prompt to generate a video.
218-
219-
```py
220-
import torch
221-
from diffusers import DiffusionPipeline
222-
from diffusers.utils import export_to_video
223-
224-
pipeline = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
225-
pipeline.enable_model_cpu_offload()
226-
pipeline.enable_vae_slicing()
227-
228-
prompt = "Confident teddy bear surfer rides the wave in the tropics"
229-
video_frames = pipeline(prompt).frames[0]
230-
export_to_video(video_frames, "modelscopet2v.mp4", fps=10)
231-
```
232-
233-
<div class="flex justify-center">
234-
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/modelscopet2v.gif" />
235-
</div>
240+
</hfoption>
241+
</hfoptions>
236242

237243
## Configure model parameters
238244

@@ -548,3 +554,9 @@ If memory is not an issue and you want to optimize for speed, try wrapping the U
548554
+ pipeline.to("cuda")
549555
+ pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True)
550556
```
557+
558+
## Quantization
559+
560+
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
561+
562+
Refer to the [Quantization](../../quantization/overview) to learn more about supported quantization backends (bitsandbytes, torchao, gguf) and selecting a quantization backend that supports your use case.

0 commit comments

Comments
 (0)