Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
c70d203
CogVideoX docs
glide-the Oct 4, 2024
3b8bea2
mv images to https://huggingface.co/datasets/huggingface/documentatio…
glide-the Oct 7, 2024
58b6157
Update docs/source/en/_toctree.yml
glide-the Oct 11, 2024
7c621b7
Update docs/source/en/using-diffusers/text-img2vid.md
glide-the Oct 11, 2024
1040fe2
Update docs/source/en/using-diffusers/text-img2vid.md
glide-the Oct 11, 2024
b681aa5
Update docs/source/en/using-diffusers/text-img2vid.md
glide-the Oct 11, 2024
aeb52ed
Update docs/source/en/training/cogvideox.md
glide-the Oct 11, 2024
6731754
Update docs/source/en/training/cogvideox.md
glide-the Oct 11, 2024
1ed46ff
Update docs/source/en/training/cogvideox.md
glide-the Oct 11, 2024
96d673f
Update docs/source/en/training/cogvideox.md
glide-the Oct 11, 2024
0159b43
Update docs/source/en/training/cogvideox.md
glide-the Oct 11, 2024
087fa97
Update docs/source/en/training/cogvideox.md
glide-the Oct 11, 2024
e8b377e
Update docs/source/en/training/cogvideox.md
glide-the Oct 11, 2024
72aebcf
Update docs/source/en/training/cogvideox.md
glide-the Oct 11, 2024
4fd19f6
Update docs/source/en/using-diffusers/cogvideox.md
glide-the Oct 11, 2024
d8a9a8f
Update docs/source/en/using-diffusers/cogvideox.md
glide-the Oct 11, 2024
6107be1
Update docs/source/en/using-diffusers/cogvideox.md
glide-the Oct 11, 2024
a940038
Update docs/source/en/using-diffusers/cogvideox.md
glide-the Oct 11, 2024
b0d4146
Update CogVideoX training documentation
glide-the Oct 11, 2024
23232f7
Reduce memory usage and update training documentation
glide-the Oct 11, 2024
ab169be
update cogvideoxmd
glide-the Oct 11, 2024
1034de0
Update docs/source/en/training/cogvideox.md
glide-the Oct 13, 2024
0c31092
Update docs/source/en/training/cogvideox.md
glide-the Oct 13, 2024
7149a16
Update docs/source/en/training/cogvideox.md
glide-the Oct 13, 2024
4b10b0c
Update docs/source/en/training/cogvideox.md
glide-the Oct 13, 2024
4badd47
Update CogVideoX documentation with improved text-to-video generation…
glide-the Oct 13, 2024
e454c95
Merge branch 'main' into doc_cogvideox
yiyixuxu Oct 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,8 @@
title: Outpainting
title: Advanced inference
- sections:
- local: using-diffusers/cogvideox
title: CogVideoX
- local: using-diffusers/sdxl
title: Stable Diffusion XL
- local: using-diffusers/sdxl_turbo
Expand Down Expand Up @@ -129,6 +131,8 @@
title: T2I-Adapters
- local: training/instructpix2pix
title: InstructPix2Pix
- local: training/cogvideox
title: CogvideoX
title: Models
- isExpanded: false
sections:
Expand Down
Binary file added docs/source/en/imgs/cogvideox_out.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/en/imgs/cogvideox_outrocket.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/en/imgs/cogvideox_rocket.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
245 changes: 245 additions & 0 deletions docs/source/en/training/cogvideox.md

Large diffs are not rendered by default.

149 changes: 149 additions & 0 deletions docs/source/en/using-diffusers/cogvideox.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# CogVideoX
CogVideoX is an open-source version of the video generation model originating from QingYing. The table below displays the list of video generation models we currently offer, along with their foundational information.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to briefly describe the technical aspects of CogVideoX so users have a better idea of how it works and what makes it different from other models (check out the Stable Diffusion XL doc as an example).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like (feel free to copy/reuse in the training doc as well):

CogVideoX is a text-to-video generation model focused on creating more coherent videos aligned with a prompt. It achieves this using several methods.

  • a 3D variational autoencoder that compresses videos spatially and temporally, improving compression rate and video accuracy.

  • an expert transformer block to help align text and video, and a 3D full attention module for capturing and creating spatially and temporally accurate videos.



## Load model checkpoints
Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the from_pretrained() method:


```
from diffusers import CogVideoXPipeline, CogVideoXImageToVideoPipeline
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-2b",
torch_dtype=torch.float16
)

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
"THUDM/CogVideoX-5b-I2V",
torch_dtype=torch.bfloat16
)

```

## Text-to-Video
For text-to-Video, pass a text prompt. By default, CogVideoX generates a 720 x 480 Video for the best results

```
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = "An elderly gentleman, with a serene expression, sits at the water's edge, a steaming cup of tea by his side. He is engrossed in his artwork, brush in hand, as he renders an oil painting on a canvas that's propped up against a small, weathered table. The sea breeze whispers through his silver hair, gently billowing his loose-fitting white shirt, while the salty air adds an intangible element to his masterpiece in progress. The scene is one of tranquility and inspiration, with the artist's canvas capturing the vibrant hues of the setting sun reflecting off the tranquil sea."

pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-5b",
torch_dtype=torch.bfloat16
)

pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

```


<div class="flex justify-center">
<img src="docs/source/en/imgs/cogvideox_out.gif" alt="generated image of an astronaut in a jungle"/>
</div>


## Image-to-Video


The are two variants of this model,[THUDM/CogVideoX-5b-I2V](https://huggingface.co/THUDM/CogVideoX-5b-I2V). The CogVideoX checkpoint is trained to generate 60 frames

You'll use the CogVideoX-5b-I2V checkpoint for this guide.

```py
import torch
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

prompt = "A vast, shimmering ocean flows gracefully under a twilight sky, its waves undulating in a mesmerizing dance of blues and greens. The surface glints with the last rays of the setting sun, casting golden highlights that ripple across the water. Seagulls soar above, their cries blending with the gentle roar of the waves. The horizon stretches infinitely, where the ocean meets the sky in a seamless blend of hues. Close-ups reveal the intricate patterns of the waves, capturing the fluidity and dynamic beauty of the sea in motion."
image = load_image(image="cogvideox_rocket.png")
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
"THUDM/CogVideoX-5b-I2V",
torch_dtype=torch.bfloat16
)

pipe.vae.enable_tiling()
pipe.vae.enable_slicing()

video = pipe(
prompt=prompt,
image=image,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)
```

<div class="flex gap-4">
<div>
<img class="rounded-xl" src="docs/source/en/imgs/cogvideox_rocket.png"/>
<figcaption class="mt-2 text-center text-sm text-gray-500">initial image</figcaption>
</div>
<div>
<img class="rounded-xl" src="docs/source/en/imgs/cogvideox_outrocket.gif"/>
<figcaption class="mt-2 text-center text-sm text-gray-500">generated video</figcaption>
</div>
</div>


## Reduce memory usage

While testing using the diffusers library, all optimizations included in the diffusers library were enabled. This
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its clearer to explicitly list which optimizations were included in the testing. Also, I don't see the table with these testing values.

scheme has not been tested for actual memory usage on devices outside of **NVIDIA A100 / H100** architectures.
Generally, this scheme can be adapted to all **NVIDIA Ampere architecture** and above devices. If optimizations are
disabled, memory consumption will multiply, with peak memory usage being about 3 times the value in the table.
However, speed will increase by about 3-4 times. You can selectively disable some optimizations, including:

```
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
```

+ For multi-GPU inference, the `enable_sequential_cpu_offload()` optimization needs to be disabled.
+ Using INT8 models will slow down inference, which is done to accommodate lower-memory GPUs while maintaining minimal
video quality loss, though inference speed will significantly decrease.
+ The CogVideoX-2B model was trained in `FP16` precision, and all CogVideoX-5B models were trained in `BF16` precision.
We recommend using the precision in which the model was trained for inference.
+ [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
used to quantize the text encoder, transformer, and VAE modules to reduce the memory requirements of CogVideoX. This
allows the model to run on free T4 Colabs or GPUs with smaller memory! Also, note that TorchAO quantization is fully
compatible with `torch.compile`, which can significantly improve inference speed. FP8 precision must be used on
devices with NVIDIA H100 and above, requiring source installation of `torch`, `torchao`, `diffusers`, and `accelerate`
Python packages. CUDA 12.4 is recommended.
+ The inference speed tests also used the above memory optimization scheme. Without memory optimization, inference speed
increases by about 10%. Only the `diffusers` version of the model supports quantization.
+ The model only supports English input; other languages can be translated into English for use via large model
refinement.
+ The memory usage of model fine-tuning is tested in an `8 * H100` environment, and the program automatically
uses `Zero 2` optimization. If a specific number of GPUs is marked in the table, that number or more GPUs must be used
for fine-tuning.
53 changes: 53 additions & 0 deletions docs/source/en/using-diffusers/text-img2vid.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,59 @@ This guide will show you how to generate videos, how to configure video model pa

[Stable Video Diffusions (SVD)](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid), [I2VGen-XL](https://huggingface.co/ali-vilab/i2vgen-xl/), [AnimateDiff](https://huggingface.co/guoyww/animatediff), and [ModelScopeT2V](https://huggingface.co/ali-vilab/text-to-video-ms-1.7b) are popular models used for video diffusion. Each model is distinct. For example, AnimateDiff inserts a motion modeling module into a frozen text-to-image model to generate personalized animated images, whereas SVD is entirely pretrained from scratch with a three-stage training process to generate short high-quality videos.

[CogVideoX](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) are popular models used for video generator. The model is a multi-dimensional transformer architecture that integrates text, time, and space. Unlike traditional cross attention mechanisms, this architecture employs Full Attention in the attention module and includes an Expert Block at the layer level to achieve spatial alignment between the two different modalities of text and video.

### CogVideoX

[CogVideoX](../api/pipelines/cogvideox) is on the 3D Variational Autoencoder (VAE) to compresses videos along both spatial and temporal dimensions .

Begin by loading the [`CogVideoXPipeline`] and passing an initial txt or image to generate a video from.
<Tip>

CogVideoX offers various generation models. The image-to-video generation model checkpoint [THUDM/CogVideoX-5b-I2V](https://huggingface.co/THUDM/CogVideoX-5b-I2V) uses the [`CogVideoXImageToVideoPipeline`] ,while the text-to-video generation model checkpoints are divided into [THUDM/CogVideoX-5b](https://huggingface.co/THUDM/CogVideoX-5b) and [THUDM/CogVideoX-2b](https://huggingface.co/THUDM/CogVideoX-2b) uses the [`CogVideoXPipeline`]

</Tip>

```py
import torch
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

prompt = "A vast, shimmering ocean flows gracefully under a twilight sky, its waves undulating in a mesmerizing dance of blues and greens. The surface glints with the last rays of the setting sun, casting golden highlights that ripple across the water. Seagulls soar above, their cries blending with the gentle roar of the waves. The horizon stretches infinitely, where the ocean meets the sky in a seamless blend of hues. Close-ups reveal the intricate patterns of the waves, capturing the fluidity and dynamic beauty of the sea in motion."
image = load_image(image="cogvideox_rocket.png")
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
"THUDM/CogVideoX-5b-I2V",
torch_dtype=torch.bfloat16
)

pipe.vae.enable_tiling()
pipe.vae.enable_slicing()

video = pipe(
prompt=prompt,
image=image,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)
```

<div class="flex gap-4">
<div>
<img class="rounded-xl" src="docs/source/en/imgs/cogvideox_rocket.png"/>
<figcaption class="mt-2 text-center text-sm text-gray-500">initial image</figcaption>
</div>
<div>
<img class="rounded-xl" src="docs/source/en/imgs/cogvideox_outrocket.gif"/>
<figcaption class="mt-2 text-center text-sm text-gray-500">generated video</figcaption>
</div>
</div>


### Stable Video Diffusion

[SVD](../api/pipelines/svd) is based on the Stable Diffusion 2.1 model and it is trained on images, then low-resolution videos, and finally a smaller dataset of high-resolution videos. This model generates a short 2-4 second video from an initial image. You can learn more details about model, like micro-conditioning, in the [Stable Video Diffusion](../using-diffusers/svd) guide.
Expand Down
Loading