Skip to content

Commit 23232f7

Browse files
committed
Reduce memory usage and update training documentation
1 parent b0d4146 commit 23232f7

File tree

2 files changed

+48
-36
lines changed

2 files changed

+48
-36
lines changed

docs/source/en/training/cogvideox.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,3 +239,48 @@ prompt = "A vast, shimmering ocean flows gracefully under a twilight sky, its wa
239239
frames = pipe(prompt, guidance_scale=6, use_dynamic_cfg=True).frames[0]
240240
export_to_video(frames, "output.mp4", fps=8)
241241
```
242+
243+
244+
## Reduce memory usage
245+
246+
While testing using the diffusers library, all optimizations included in the diffusers library were enabled. This
247+
scheme has not been tested for actual memory usage on devices outside of **NVIDIA A100 / H100** architectures.
248+
Generally, this scheme can be adapted to all **NVIDIA Ampere architecture** and above devices. If optimizations are
249+
disabled, memory consumption will multiply, with peak memory usage being about 3 times the value in the table.
250+
However, speed will increase by about 3-4 times. You can selectively disable some optimizations, including:
251+
252+
```
253+
pipe.enable_sequential_cpu_offload()
254+
pipe.vae.enable_slicing()
255+
pipe.vae.enable_tiling()
256+
```
257+
258+
+ For multi-GPU inference, the `enable_sequential_cpu_offload()` optimization needs to be disabled.
259+
+ Using INT8 models will slow down inference, which is done to accommodate lower-memory GPUs while maintaining minimal
260+
video quality loss, though inference speed will significantly decrease.
261+
+ The CogVideoX-2B model was trained in `FP16` precision, and all CogVideoX-5B models were trained in `BF16` precision.
262+
We recommend using the precision in which the model was trained for inference.
263+
+ [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
264+
used to quantize the text encoder, transformer, and VAE modules to reduce the memory requirements of CogVideoX. This
265+
allows the model to run on free T4 Colabs or GPUs with smaller memory! Also, note that TorchAO quantization is fully
266+
compatible with `torch.compile`, which can significantly improve inference speed. FP8 precision must be used on
267+
devices with NVIDIA H100 and above, requiring source installation of `torch`, `torchao`, `diffusers`, and `accelerate`
268+
Python packages. CUDA 12.4 is recommended.
269+
+ The inference speed tests also used the above memory optimization scheme. Without memory optimization, inference speed
270+
increases by about 10%. Only the `diffusers` version of the model supports quantization.
271+
+ The model only supports English input; other languages can be translated into English for use via large model
272+
refinement.
273+
+ The memory usage of model fine-tuning is tested in an `8 * H100` environment, and the program automatically
274+
uses `Zero 2` optimization. If a specific number of GPUs is marked in the table, that number or more GPUs must be used
275+
for fine-tuning.
276+
277+
278+
| **Attribute** | **CogVideoX-2B** | **CogVideoX-5B** |
279+
| ------------------------------------ | ---------------------------------------------------------------------- | ---------------------------------------------------------------------- |
280+
| **Model Name** | CogVideoX-2B | CogVideoX-5B |
281+
| **Inference Precision** | FP16* (Recommended), BF16, FP32, FP8*, INT8, Not supported INT4 | BF16 (Recommended), FP16, FP32, FP8*, INT8, Not supported INT4 |
282+
| **Single GPU Inference VRAM** | FP16: Using diffusers 12.5GB* INT8: Using diffusers with torchao 7.8GB* | BF16: Using diffusers 20.7GB* INT8: Using diffusers with torchao 11.4GB* |
283+
| **Multi GPU Inference VRAM** | FP16: Using diffusers 10GB* | BF16: Using diffusers 15GB* |
284+
| **Inference Speed** | Single A100: ~90 seconds, Single H100: ~45 seconds | Single A100: ~180 seconds, Single H100: ~90 seconds |
285+
| **Fine-tuning Precision** | FP16 | BF16 |
286+
| **Fine-tuning VRAM Consumption** | 47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT) | 63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT) |

docs/source/en/using-diffusers/cogvideox.md

Lines changed: 3 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ specific language governing permissions and limitations under the License.
1212
# CogVideoX
1313
CogVideoX is an open-source version of the video generation model originating from QingYing. The table below displays the list of video generation models we currently offer, along with their foundational information.
1414

15+
The model integrates text, time, and space into a transformer architecture. This architecture abandons the traditional cross attention module and converts it to Full Attention. It uses Expert Block to achieve alignment of two different modal spaces: text and video.
16+
1517

1618
## Load model checkpoints
1719
Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~DiffusionPipeline.from_pretrained`] method.
@@ -71,9 +73,7 @@ export_to_video(video, "output.mp4", fps=8)
7173
## Image-to-Video
7274

7375

74-
The are two variants of this model,[THUDM/CogVideoX-5b-I2V](https://huggingface.co/THUDM/CogVideoX-5b-I2V). The CogVideoX checkpoint is trained to generate 60 frames
75-
76-
You'll use the CogVideoX-5b-I2V checkpoint for this guide.
76+
You'll use the [THUDM/CogVideoX-5b-I2V](https://huggingface.co/THUDM/CogVideoX-5b-I2V) checkpoint for this guide.
7777

7878
```py
7979
import torch
@@ -114,36 +114,3 @@ export_to_video(video, "output.mp4", fps=8)
114114
</div>
115115
</div>
116116

117-
118-
## Reduce memory usage
119-
120-
While testing using the diffusers library, all optimizations included in the diffusers library were enabled. This
121-
scheme has not been tested for actual memory usage on devices outside of **NVIDIA A100 / H100** architectures.
122-
Generally, this scheme can be adapted to all **NVIDIA Ampere architecture** and above devices. If optimizations are
123-
disabled, memory consumption will multiply, with peak memory usage being about 3 times the value in the table.
124-
However, speed will increase by about 3-4 times. You can selectively disable some optimizations, including:
125-
126-
```
127-
pipe.enable_sequential_cpu_offload()
128-
pipe.vae.enable_slicing()
129-
pipe.vae.enable_tiling()
130-
```
131-
132-
+ For multi-GPU inference, the `enable_sequential_cpu_offload()` optimization needs to be disabled.
133-
+ Using INT8 models will slow down inference, which is done to accommodate lower-memory GPUs while maintaining minimal
134-
video quality loss, though inference speed will significantly decrease.
135-
+ The CogVideoX-2B model was trained in `FP16` precision, and all CogVideoX-5B models were trained in `BF16` precision.
136-
We recommend using the precision in which the model was trained for inference.
137-
+ [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
138-
used to quantize the text encoder, transformer, and VAE modules to reduce the memory requirements of CogVideoX. This
139-
allows the model to run on free T4 Colabs or GPUs with smaller memory! Also, note that TorchAO quantization is fully
140-
compatible with `torch.compile`, which can significantly improve inference speed. FP8 precision must be used on
141-
devices with NVIDIA H100 and above, requiring source installation of `torch`, `torchao`, `diffusers`, and `accelerate`
142-
Python packages. CUDA 12.4 is recommended.
143-
+ The inference speed tests also used the above memory optimization scheme. Without memory optimization, inference speed
144-
increases by about 10%. Only the `diffusers` version of the model supports quantization.
145-
+ The model only supports English input; other languages can be translated into English for use via large model
146-
refinement.
147-
+ The memory usage of model fine-tuning is tested in an `8 * H100` environment, and the program automatically
148-
uses `Zero 2` optimization. If a specific number of GPUs is marked in the table, that number or more GPUs must be used
149-
for fine-tuning.

0 commit comments

Comments
 (0)