Skip to content

Commit c70d203

Browse files
committed
CogVideoX docs
1 parent 99f6082 commit c70d203

File tree

7 files changed

+451
-0
lines changed

7 files changed

+451
-0
lines changed

docs/source/en/_toctree.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,8 @@
7575
title: Outpainting
7676
title: Advanced inference
7777
- sections:
78+
- local: using-diffusers/cogvideox
79+
title: CogVideoX
7880
- local: using-diffusers/sdxl
7981
title: Stable Diffusion XL
8082
- local: using-diffusers/sdxl_turbo
@@ -129,6 +131,8 @@
129131
title: T2I-Adapters
130132
- local: training/instructpix2pix
131133
title: InstructPix2Pix
134+
- local: training/cogvideox
135+
title: CogvideoX
132136
title: Models
133137
- isExpanded: false
134138
sections:
5.32 MB
Loading
1.34 MB
Loading
376 KB
Loading

docs/source/en/training/cogvideox.md

Lines changed: 245 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
# CogVideoX
13+
CogVideoX is an open-source version of the video generation model originating from QingYing. The table below displays the list of video generation models we currently offer, along with their foundational information.
14+
15+
16+
## Load model checkpoints
17+
Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the from_pretrained() method:
18+
19+
20+
```
21+
from diffusers import CogVideoXPipeline, CogVideoXImageToVideoPipeline
22+
pipe = CogVideoXPipeline.from_pretrained(
23+
"THUDM/CogVideoX-2b",
24+
torch_dtype=torch.float16
25+
)
26+
27+
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
28+
"THUDM/CogVideoX-5b-I2V",
29+
torch_dtype=torch.bfloat16
30+
)
31+
32+
```
33+
34+
## Text-to-Video
35+
For text-to-Video, pass a text prompt. By default, CogVideoX generates a 720 x 480 Video for the best results
36+
37+
```
38+
import torch
39+
from diffusers import CogVideoXPipeline
40+
from diffusers.utils import export_to_video
41+
42+
prompt = "An elderly gentleman, with a serene expression, sits at the water's edge, a steaming cup of tea by his side. He is engrossed in his artwork, brush in hand, as he renders an oil painting on a canvas that's propped up against a small, weathered table. The sea breeze whispers through his silver hair, gently billowing his loose-fitting white shirt, while the salty air adds an intangible element to his masterpiece in progress. The scene is one of tranquility and inspiration, with the artist's canvas capturing the vibrant hues of the setting sun reflecting off the tranquil sea."
43+
44+
pipe = CogVideoXPipeline.from_pretrained(
45+
"THUDM/CogVideoX-5b",
46+
torch_dtype=torch.bfloat16
47+
)
48+
49+
pipe.enable_model_cpu_offload()
50+
pipe.vae.enable_tiling()
51+
52+
video = pipe(
53+
prompt=prompt,
54+
num_videos_per_prompt=1,
55+
num_inference_steps=50,
56+
num_frames=49,
57+
guidance_scale=6,
58+
generator=torch.Generator(device="cuda").manual_seed(42),
59+
).frames[0]
60+
61+
export_to_video(video, "output.mp4", fps=8)
62+
63+
```
64+
65+
66+
<div class="flex justify-center">
67+
<img src="docs/source/en/imgs/cogvideox_out.gif" alt="generated image of an astronaut in a jungle"/>
68+
</div>
69+
70+
71+
## Image-to-Video
72+
73+
74+
The are two variants of this model,[THUDM/CogVideoX-5b-I2V](https://huggingface.co/THUDM/CogVideoX-5b-I2V). The CogVideoX checkpoint is trained to generate 60 frames
75+
76+
You'll use the CogVideoX-5b-I2V checkpoint for this guide.
77+
78+
```py
79+
import torch
80+
from diffusers import CogVideoXImageToVideoPipeline
81+
from diffusers.utils import export_to_video, load_image
82+
83+
prompt = "A vast, shimmering ocean flows gracefully under a twilight sky, its waves undulating in a mesmerizing dance of blues and greens. The surface glints with the last rays of the setting sun, casting golden highlights that ripple across the water. Seagulls soar above, their cries blending with the gentle roar of the waves. The horizon stretches infinitely, where the ocean meets the sky in a seamless blend of hues. Close-ups reveal the intricate patterns of the waves, capturing the fluidity and dynamic beauty of the sea in motion."
84+
image = load_image(image="cogvideox_rocket.png")
85+
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
86+
"THUDM/CogVideoX-5b-I2V",
87+
torch_dtype=torch.bfloat16
88+
)
89+
90+
pipe.vae.enable_tiling()
91+
pipe.vae.enable_slicing()
92+
93+
video = pipe(
94+
prompt=prompt,
95+
image=image,
96+
num_videos_per_prompt=1,
97+
num_inference_steps=50,
98+
num_frames=49,
99+
guidance_scale=6,
100+
generator=torch.Generator(device="cuda").manual_seed(42),
101+
).frames[0]
102+
103+
export_to_video(video, "output.mp4", fps=8)
104+
```
105+
106+
<div class="flex gap-4">
107+
<div>
108+
<img class="rounded-xl" src="docs/source/en/imgs/cogvideox_rocket.png"/>
109+
<figcaption class="mt-2 text-center text-sm text-gray-500">initial image</figcaption>
110+
</div>
111+
<div>
112+
<img class="rounded-xl" src="docs/source/en/imgs/cogvideox_outrocket.gif"/>
113+
<figcaption class="mt-2 text-center text-sm text-gray-500">generated video</figcaption>
114+
</div>
115+
</div>
116+
117+
118+
## Reduce memory usage
119+
120+
While testing using the diffusers library, all optimizations included in the diffusers library were enabled. This
121+
scheme has not been tested for actual memory usage on devices outside of **NVIDIA A100 / H100** architectures.
122+
Generally, this scheme can be adapted to all **NVIDIA Ampere architecture** and above devices. If optimizations are
123+
disabled, memory consumption will multiply, with peak memory usage being about 3 times the value in the table.
124+
However, speed will increase by about 3-4 times. You can selectively disable some optimizations, including:
125+
126+
```
127+
pipe.enable_sequential_cpu_offload()
128+
pipe.vae.enable_slicing()
129+
pipe.vae.enable_tiling()
130+
```
131+
132+
+ For multi-GPU inference, the `enable_sequential_cpu_offload()` optimization needs to be disabled.
133+
+ Using INT8 models will slow down inference, which is done to accommodate lower-memory GPUs while maintaining minimal
134+
video quality loss, though inference speed will significantly decrease.
135+
+ The CogVideoX-2B model was trained in `FP16` precision, and all CogVideoX-5B models were trained in `BF16` precision.
136+
We recommend using the precision in which the model was trained for inference.
137+
+ [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
138+
used to quantize the text encoder, transformer, and VAE modules to reduce the memory requirements of CogVideoX. This
139+
allows the model to run on free T4 Colabs or GPUs with smaller memory! Also, note that TorchAO quantization is fully
140+
compatible with `torch.compile`, which can significantly improve inference speed. FP8 precision must be used on
141+
devices with NVIDIA H100 and above, requiring source installation of `torch`, `torchao`, `diffusers`, and `accelerate`
142+
Python packages. CUDA 12.4 is recommended.
143+
+ The inference speed tests also used the above memory optimization scheme. Without memory optimization, inference speed
144+
increases by about 10%. Only the `diffusers` version of the model supports quantization.
145+
+ The model only supports English input; other languages can be translated into English for use via large model
146+
refinement.
147+
+ The memory usage of model fine-tuning is tested in an `8 * H100` environment, and the program automatically
148+
uses `Zero 2` optimization. If a specific number of GPUs is marked in the table, that number or more GPUs must be used
149+
for fine-tuning.

docs/source/en/using-diffusers/text-img2vid.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,59 @@ This guide will show you how to generate videos, how to configure video model pa
2323
2424
[Stable Video Diffusions (SVD)](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid), [I2VGen-XL](https://huggingface.co/ali-vilab/i2vgen-xl/), [AnimateDiff](https://huggingface.co/guoyww/animatediff), and [ModelScopeT2V](https://huggingface.co/ali-vilab/text-to-video-ms-1.7b) are popular models used for video diffusion. Each model is distinct. For example, AnimateDiff inserts a motion modeling module into a frozen text-to-image model to generate personalized animated images, whereas SVD is entirely pretrained from scratch with a three-stage training process to generate short high-quality videos.
2525

26+
[CogVideoX](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) are popular models used for video generator. The model is a multi-dimensional transformer architecture that integrates text, time, and space. Unlike traditional cross attention mechanisms, this architecture employs Full Attention in the attention module and includes an Expert Block at the layer level to achieve spatial alignment between the two different modalities of text and video.
27+
28+
### CogVideoX
29+
30+
[CogVideoX](../api/pipelines/cogvideox) is on the 3D Variational Autoencoder (VAE) to compresses videos along both spatial and temporal dimensions .
31+
32+
Begin by loading the [`CogVideoXPipeline`] and passing an initial txt or image to generate a video from.
33+
<Tip>
34+
35+
CogVideoX offers various generation models. The image-to-video generation model checkpoint [THUDM/CogVideoX-5b-I2V](https://huggingface.co/THUDM/CogVideoX-5b-I2V) uses the [`CogVideoXImageToVideoPipeline`] ,while the text-to-video generation model checkpoints are divided into [THUDM/CogVideoX-5b](https://huggingface.co/THUDM/CogVideoX-5b) and [THUDM/CogVideoX-2b](https://huggingface.co/THUDM/CogVideoX-2b) uses the [`CogVideoXPipeline`]
36+
37+
</Tip>
38+
39+
```py
40+
import torch
41+
from diffusers import CogVideoXImageToVideoPipeline
42+
from diffusers.utils import export_to_video, load_image
43+
44+
prompt = "A vast, shimmering ocean flows gracefully under a twilight sky, its waves undulating in a mesmerizing dance of blues and greens. The surface glints with the last rays of the setting sun, casting golden highlights that ripple across the water. Seagulls soar above, their cries blending with the gentle roar of the waves. The horizon stretches infinitely, where the ocean meets the sky in a seamless blend of hues. Close-ups reveal the intricate patterns of the waves, capturing the fluidity and dynamic beauty of the sea in motion."
45+
image = load_image(image="cogvideox_rocket.png")
46+
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
47+
"THUDM/CogVideoX-5b-I2V",
48+
torch_dtype=torch.bfloat16
49+
)
50+
51+
pipe.vae.enable_tiling()
52+
pipe.vae.enable_slicing()
53+
54+
video = pipe(
55+
prompt=prompt,
56+
image=image,
57+
num_videos_per_prompt=1,
58+
num_inference_steps=50,
59+
num_frames=49,
60+
guidance_scale=6,
61+
generator=torch.Generator(device="cuda").manual_seed(42),
62+
).frames[0]
63+
64+
export_to_video(video, "output.mp4", fps=8)
65+
```
66+
67+
<div class="flex gap-4">
68+
<div>
69+
<img class="rounded-xl" src="docs/source/en/imgs/cogvideox_rocket.png"/>
70+
<figcaption class="mt-2 text-center text-sm text-gray-500">initial image</figcaption>
71+
</div>
72+
<div>
73+
<img class="rounded-xl" src="docs/source/en/imgs/cogvideox_outrocket.gif"/>
74+
<figcaption class="mt-2 text-center text-sm text-gray-500">generated video</figcaption>
75+
</div>
76+
</div>
77+
78+
2679
### Stable Video Diffusion
2780

2881
[SVD](../api/pipelines/svd) is based on the Stable Diffusion 2.1 model and it is trained on images, then low-resolution videos, and finally a smaller dataset of high-resolution videos. This model generates a short 2-4 second video from an initial image. You can learn more details about model, like micro-conditioning, in the [Stable Video Diffusion](../using-diffusers/svd) guide.

0 commit comments

Comments
 (0)