Skip to content

Commit 73c640f

Browse files
committed
feedback
1 parent ac6939f commit 73c640f

File tree

1 file changed

+89
-5
lines changed

1 file changed

+89
-5
lines changed

docs/source/en/using-diffusers/text-img2vid.md

Lines changed: 89 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,18 +12,18 @@ specific language governing permissions and limitations under the License.
1212

1313
# Video generation
1414

15-
Video generation models add a temporal dimension to image generation models to bring the images, or frames, together to create a video. These models are trained on large-scale datasets of high-quality text-video pairs to learn how to combine the modalities to ensure the generated video is coherent and realistic.
15+
Video generation models add a temporal dimension to bring images, or frames, together to create a video. These models are trained on large-scale datasets of high-quality text-video pairs to learn how to combine the modalities to ensure the generated video is coherent and realistic.
1616

17-
Explore some of the more popular open-source video generation models available from Diffusers below.
17+
[Explore](https://huggingface.co/models?other=video-generation) some of the more popular open-source video generation models available from Diffusers below.
1818

1919
<hfoptions id="popular-models">
2020
<hfoption id="CogVideoX">
2121

2222
[CogVideoX](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) uses a 3D causal Variational Autoencoder (VAE) to compress videos along the spatial and temporal dimensions, and it includes a stack of expert transformer blocks with a 3D full attention mechanism to better capture visual, semantic, and motion information in the data.
2323

24-
The CogVideoX family also includes models capable of generating videos from images in addition to text. These models are indicated by **I2V** in the checkpoint name, and they should be used with the [`CogVideoXImageToVideoPipeline`].
24+
The CogVideoX family also includes models capable of generating videos from images and videos in addition to text. The image-to-video models are indicated by **I2V** in the checkpoint name, and they should be used with the [`CogVideoXImageToVideoPipeline`]. The regular checkpoints support video-to-video through the [`CogVideoXVideoToVideoPipeline`].
2525

26-
The example below demonstrates how to generate a video from an image and text prompt with [THUDM/CogVideoX-5b-I2V](https://huggingface.co/THUDM/CogVideoX-5b-I2V).
26+
The example below demonstrates how to generate a video from an image and text prompt with [THUDM/CogVideoX1.5-5B-I2V](https://huggingface.co/THUDM/CogVideoX1.5-5B-I2V).
2727

2828
```py
2929
import torch
@@ -33,7 +33,7 @@ from diffusers.utils import export_to_video, load_image
3333
prompt = "A vast, shimmering ocean flows gracefully under a twilight sky, its waves undulating in a mesmerizing dance of blues and greens. The surface glints with the last rays of the setting sun, casting golden highlights that ripple across the water. Seagulls soar above, their cries blending with the gentle roar of the waves. The horizon stretches infinitely, where the ocean meets the sky in a seamless blend of hues. Close-ups reveal the intricate patterns of the waves, capturing the fluidity and dynamic beauty of the sea in motion."
3434
image = load_image(image="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cogvideox/cogvideox_rocket.png")
3535
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
36-
"THUDM/CogVideoX-5b-I2V",
36+
"THUDM/CogVideoX1.5-5B-I2V",
3737
torch_dtype=torch.bfloat16
3838
)
3939

@@ -67,6 +67,9 @@ export_to_video(video, "output.mp4", fps=8)
6767
</hfoption>
6868
<hfoption id="HunyuanVideo">
6969

70+
> [!TIP]
71+
> HunyuanVideo is a 13B parameter model and requires a lot of memory. Refer to the HunyuanVideo [Quantization](../api/pipelines/hunyuan_video#quantization) guide to learn how to quantize the model. CogVideoX and LTX-Video are more lightweight options that can still generate high-quality videos.
72+
7073
[HunyuanVideo](https://huggingface.co/tencent/HunyuanVideo) features a dual-stream to single-stream diffusion transformer (DiT) for learning video and text tokens separately, and then subsequently concatenating the video and text tokens to combine their information. A single multimodal large language model (MLLM) serves as the text encoder, and videos are also spatio-temporally compressed with a 3D causal VAE.
7174

7275
```py
@@ -80,6 +83,8 @@ transformer = HunyuanVideoTransformer3DModel.from_pretrained(
8083
pipe = HunyuanVideoPipeline.from_pretrained(
8184
"tencent/HunyuanVideo", transformer=transformer, torch_dtype=torch.float16
8285
)
86+
87+
# reduce memory requirements
8388
pipe.vae.enable_tiling()
8489
pipe.to("cuda")
8590

@@ -127,6 +132,9 @@ export_to_video(video, "output.mp4", fps=24)
127132
</hfoption>
128133
<hfoption id="Mochi-1">
129134

135+
> [!TIP]
136+
> Mochi-1 is a 10B parameter model and requires a lot of memory. Refer to the Mochi [Quantization](../api/pipelines/mochi#quantization) guide to learn how to quantize the model. CogVideoX and LTX-Video are more lightweight options that can still generate high-quality videos.
137+
130138
[Mochi-1](https://huggingface.co/genmo/mochi-1-preview) introduces the Asymmetric Diffusion Transformer (AsymmDiT) and Asymmetric Variational Autoencoder (AsymmVAE) to reduces memory requirements. AsymmVAE causally compresses videos 128x to improve memory efficiency, and AsymmDiT jointly attends to the compressed video tokens and user text tokens. This model is noted for generating videos with high-quality motion dynamics and strong prompt adherence.
131139

132140
```py
@@ -149,6 +157,82 @@ export_to_video(video, "output.mp4", fps=30)
149157
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/mochi-video-output.gif"/>
150158
</div>
151159

160+
</hfoption>
161+
<hfoption id="StableVideoDiffusion">
162+
163+
[StableVideoDiffusion (SVD)](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) is based on the Stable Diffusion 2.1 model and it is trained on images, then low-resolution videos, and finally a smaller dataset of high-resolution videos. This model generates a short 2-4 second video from an initial image.
164+
165+
```py
166+
import torch
167+
from diffusers import StableVideoDiffusionPipeline
168+
from diffusers.utils import load_image, export_to_video
169+
170+
pipeline = StableVideoDiffusionPipeline.from_pretrained(
171+
"stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
172+
)
173+
174+
# reduce memory requirements
175+
pipeline.enable_model_cpu_offload()
176+
177+
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
178+
image = image.resize((1024, 576))
179+
180+
generator = torch.manual_seed(42)
181+
frames = pipeline(image, decode_chunk_size=8, generator=generator).frames[0]
182+
export_to_video(frames, "generated.mp4", fps=7)
183+
```
184+
185+
<div class="flex gap-4">
186+
<div>
187+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png"/>
188+
<figcaption class="mt-2 text-center text-sm text-gray-500">initial image</figcaption>
189+
</div>
190+
<div>
191+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket.gif"/>
192+
<figcaption class="mt-2 text-center text-sm text-gray-500">generated video</figcaption>
193+
</div>
194+
</div>
195+
196+
</hfoption>
197+
<hfoption id="AnimateDiff">
198+
199+
[AnimateDiff](https://huggingface.co/guoyww/animatediff) is an adapter model that inserts a motion module into a pretrained diffusion model to animate an image. The adapter is trained on video clips to learn motion which is used to condition the generation process to create a video. It is faster and easier to only train the adapter and it can be loaded into most diffusion models, effectively turning them into “video models”.
200+
201+
Load a `MotionAdapter` and pass it to the [`AnimateDiffPipeline`].
202+
203+
```py
204+
import torch
205+
from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter
206+
from diffusers.utils import export_to_gif
207+
208+
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
209+
pipeline = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter, torch_dtype=torch.float16)
210+
scheduler = DDIMScheduler.from_pretrained(
211+
"emilianJR/epiCRealism",
212+
subfolder="scheduler",
213+
clip_sample=False,
214+
timestep_spacing="linspace",
215+
beta_schedule="linear",
216+
steps_offset=1,
217+
)
218+
pipeline.scheduler = scheduler
219+
220+
# reduce memory requirements
221+
pipeline.enable_vae_slicing()
222+
pipeline.enable_model_cpu_offload()
223+
224+
output = pipeline(
225+
prompt="A space rocket with trails of smoke behind it launching into space from the desert, 4k, high resolution",
226+
negative_prompt="bad quality, worse quality, low resolution",
227+
num_frames=16,
228+
guidance_scale=7.5,
229+
num_inference_steps=50,
230+
generator=torch.Generator("cpu").manual_seed(49),
231+
)
232+
frames = output.frames[0]
233+
export_to_gif(frames, "animation.gif")
234+
```
235+
152236
</hfoption>
153237
</hfoptions>
154238

0 commit comments

Comments
 (0)