|
| 1 | +<!--Copyright 2024 The HuggingFace Team. All rights reserved. |
| 2 | +
|
| 3 | +Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with |
| 4 | +the License. You may obtain a copy of the License at |
| 5 | +
|
| 6 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 7 | +
|
| 8 | +Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on |
| 9 | +an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the |
| 10 | +specific language governing permissions and limitations under the License. |
| 11 | +--> |
| 12 | +# CogVideoX |
| 13 | +CogVideoX is an open-source version of the video generation model originating from QingYing. The table below displays the list of video generation models we currently offer, along with their foundational information. |
| 14 | + |
| 15 | + |
| 16 | +## Load model checkpoints |
| 17 | +Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the from_pretrained() method: |
| 18 | + |
| 19 | + |
| 20 | +``` |
| 21 | +from diffusers import CogVideoXPipeline, CogVideoXImageToVideoPipeline |
| 22 | +pipe = CogVideoXPipeline.from_pretrained( |
| 23 | + "THUDM/CogVideoX-2b", |
| 24 | + torch_dtype=torch.float16 |
| 25 | +) |
| 26 | +
|
| 27 | +pipe = CogVideoXImageToVideoPipeline.from_pretrained( |
| 28 | + "THUDM/CogVideoX-5b-I2V", |
| 29 | + torch_dtype=torch.bfloat16 |
| 30 | +) |
| 31 | +
|
| 32 | +``` |
| 33 | + |
| 34 | +## Text-to-Video |
| 35 | +For text-to-Video, pass a text prompt. By default, CogVideoX generates a 720 x 480 Video for the best results |
| 36 | + |
| 37 | +``` |
| 38 | +import torch |
| 39 | +from diffusers import CogVideoXPipeline |
| 40 | +from diffusers.utils import export_to_video |
| 41 | +
|
| 42 | +prompt = "An elderly gentleman, with a serene expression, sits at the water's edge, a steaming cup of tea by his side. He is engrossed in his artwork, brush in hand, as he renders an oil painting on a canvas that's propped up against a small, weathered table. The sea breeze whispers through his silver hair, gently billowing his loose-fitting white shirt, while the salty air adds an intangible element to his masterpiece in progress. The scene is one of tranquility and inspiration, with the artist's canvas capturing the vibrant hues of the setting sun reflecting off the tranquil sea." |
| 43 | +
|
| 44 | +pipe = CogVideoXPipeline.from_pretrained( |
| 45 | + "THUDM/CogVideoX-5b", |
| 46 | + torch_dtype=torch.bfloat16 |
| 47 | +) |
| 48 | +
|
| 49 | +pipe.enable_model_cpu_offload() |
| 50 | +pipe.vae.enable_tiling() |
| 51 | +
|
| 52 | +video = pipe( |
| 53 | + prompt=prompt, |
| 54 | + num_videos_per_prompt=1, |
| 55 | + num_inference_steps=50, |
| 56 | + num_frames=49, |
| 57 | + guidance_scale=6, |
| 58 | + generator=torch.Generator(device="cuda").manual_seed(42), |
| 59 | +).frames[0] |
| 60 | +
|
| 61 | +export_to_video(video, "output.mp4", fps=8) |
| 62 | +
|
| 63 | +``` |
| 64 | + |
| 65 | + |
| 66 | +<div class="flex justify-center"> |
| 67 | + <img src="docs/source/en/imgs/cogvideox_out.gif" alt="generated image of an astronaut in a jungle"/> |
| 68 | +</div> |
| 69 | + |
| 70 | + |
| 71 | +## Image-to-Video |
| 72 | + |
| 73 | + |
| 74 | +The are two variants of this model,[THUDM/CogVideoX-5b-I2V](https://huggingface.co/THUDM/CogVideoX-5b-I2V). The CogVideoX checkpoint is trained to generate 60 frames |
| 75 | + |
| 76 | +You'll use the CogVideoX-5b-I2V checkpoint for this guide. |
| 77 | + |
| 78 | +```py |
| 79 | +import torch |
| 80 | +from diffusers import CogVideoXImageToVideoPipeline |
| 81 | +from diffusers.utils import export_to_video, load_image |
| 82 | + |
| 83 | +prompt = "A vast, shimmering ocean flows gracefully under a twilight sky, its waves undulating in a mesmerizing dance of blues and greens. The surface glints with the last rays of the setting sun, casting golden highlights that ripple across the water. Seagulls soar above, their cries blending with the gentle roar of the waves. The horizon stretches infinitely, where the ocean meets the sky in a seamless blend of hues. Close-ups reveal the intricate patterns of the waves, capturing the fluidity and dynamic beauty of the sea in motion." |
| 84 | +image = load_image(image="cogvideox_rocket.png") |
| 85 | +pipe = CogVideoXImageToVideoPipeline.from_pretrained( |
| 86 | + "THUDM/CogVideoX-5b-I2V", |
| 87 | + torch_dtype=torch.bfloat16 |
| 88 | +) |
| 89 | + |
| 90 | +pipe.vae.enable_tiling() |
| 91 | +pipe.vae.enable_slicing() |
| 92 | + |
| 93 | +video = pipe( |
| 94 | + prompt=prompt, |
| 95 | + image=image, |
| 96 | + num_videos_per_prompt=1, |
| 97 | + num_inference_steps=50, |
| 98 | + num_frames=49, |
| 99 | + guidance_scale=6, |
| 100 | + generator=torch.Generator(device="cuda").manual_seed(42), |
| 101 | +).frames[0] |
| 102 | + |
| 103 | +export_to_video(video, "output.mp4", fps=8) |
| 104 | +``` |
| 105 | + |
| 106 | +<div class="flex gap-4"> |
| 107 | + <div> |
| 108 | + <img class="rounded-xl" src="docs/source/en/imgs/cogvideox_rocket.png"/> |
| 109 | + <figcaption class="mt-2 text-center text-sm text-gray-500">initial image</figcaption> |
| 110 | + </div> |
| 111 | + <div> |
| 112 | + <img class="rounded-xl" src="docs/source/en/imgs/cogvideox_outrocket.gif"/> |
| 113 | + <figcaption class="mt-2 text-center text-sm text-gray-500">generated video</figcaption> |
| 114 | + </div> |
| 115 | +</div> |
| 116 | + |
| 117 | + |
| 118 | +## Reduce memory usage |
| 119 | + |
| 120 | +While testing using the diffusers library, all optimizations included in the diffusers library were enabled. This |
| 121 | +scheme has not been tested for actual memory usage on devices outside of **NVIDIA A100 / H100** architectures. |
| 122 | +Generally, this scheme can be adapted to all **NVIDIA Ampere architecture** and above devices. If optimizations are |
| 123 | +disabled, memory consumption will multiply, with peak memory usage being about 3 times the value in the table. |
| 124 | +However, speed will increase by about 3-4 times. You can selectively disable some optimizations, including: |
| 125 | + |
| 126 | +``` |
| 127 | +pipe.enable_sequential_cpu_offload() |
| 128 | +pipe.vae.enable_slicing() |
| 129 | +pipe.vae.enable_tiling() |
| 130 | +``` |
| 131 | + |
| 132 | ++ For multi-GPU inference, the `enable_sequential_cpu_offload()` optimization needs to be disabled. |
| 133 | ++ Using INT8 models will slow down inference, which is done to accommodate lower-memory GPUs while maintaining minimal |
| 134 | + video quality loss, though inference speed will significantly decrease. |
| 135 | ++ The CogVideoX-2B model was trained in `FP16` precision, and all CogVideoX-5B models were trained in `BF16` precision. |
| 136 | + We recommend using the precision in which the model was trained for inference. |
| 137 | ++ [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be |
| 138 | + used to quantize the text encoder, transformer, and VAE modules to reduce the memory requirements of CogVideoX. This |
| 139 | + allows the model to run on free T4 Colabs or GPUs with smaller memory! Also, note that TorchAO quantization is fully |
| 140 | + compatible with `torch.compile`, which can significantly improve inference speed. FP8 precision must be used on |
| 141 | + devices with NVIDIA H100 and above, requiring source installation of `torch`, `torchao`, `diffusers`, and `accelerate` |
| 142 | + Python packages. CUDA 12.4 is recommended. |
| 143 | ++ The inference speed tests also used the above memory optimization scheme. Without memory optimization, inference speed |
| 144 | + increases by about 10%. Only the `diffusers` version of the model supports quantization. |
| 145 | ++ The model only supports English input; other languages can be translated into English for use via large model |
| 146 | + refinement. |
| 147 | ++ The memory usage of model fine-tuning is tested in an `8 * H100` environment, and the program automatically |
| 148 | + uses `Zero 2` optimization. If a specific number of GPUs is marked in the table, that number or more GPUs must be used |
| 149 | + for fine-tuning. |
0 commit comments