Multi-GPU parallel encoding support for training videos. (#6)

zRzRzRzRzRzRzR · a-r-r-o-w · web-flow · commit cc1d2e759f32 · 2024-10-09T04:19:59.000+05:30
* Multi-GPU parallel encoding support for training videos.

* revert

* make style

* update

---------

Co-authored-by: Aryan &lt;aryan@huggingface.co&gt;
diff --git a/.gitignore b/.gitignore
@@ -3,6 +3,9 @@ __pycache__/
 *.py[cod]
 *$py.class
 
+# JetBrains
+.idea
+
 # C extensions
 *.so
 
diff --git a/README.md b/README.md
@@ -402,4 +402,4 @@ With `train_batch_size = 4`:
 - [x] Make 5B lora finetuning work in under 24GB
 
 > [!IMPORTANT]
-> Since our goal is to make the scripts as memory-friendly as possible we don't guarantee multi-GPU training.
+> Since our goal is to make the scripts as memory-friendly as possible we don't guarantee multi-GPU training.
diff --git a/README_zh.md b/README_zh.md
@@ -0,0 +1,95 @@
+# CogVideoX Factory
+
+## 简介
+
+这是用于 CogVideoX 微调的仓库。
+
+## 数据集准备
+
+创建两个文件，一个文件包含以换行符分隔的提示词，另一个文件包含以换行符分隔的视频数据路径（视频文件的路径必须相对于您在指定 `--data_root` 时传递的路径）。让我们通过一个例子来更好地理解这一点！
+
+假设您将 `--data_root` 指定为 `/dataset`，并且该目录包含文件：`prompts.txt` 和 `videos.txt`。
+
+`prompts.txt` 文件应包含以换行符分隔的提示词：
+
+```
+一段黑白动画序列，主角是一只名为 Rabbity Ribfried 的兔子和一只拟人化的山羊，在一个充满音乐和趣味的环境中，展示他们不断发展的互动。
+一段黑白动画序列，场景在船甲板上，主角是一只名为 Bully Bulldoger 的斗牛犬角色，展示了夸张的面部表情和肢体语言。角色从自信到专注，再到紧张和痛苦，展示了一系列情绪，随着它克服挑战。船的内部在背景中保持静止，只有简单的细节，如钟声和开着的门。角色的动态动作和变化的表情推动了故事的发展，没有镜头移动，确保观众专注于其不断变化的反应和肢体动作。
+...
+```
+
+`videos.txt` 文件应包含以换行符分隔的视频文件路径。请注意，路径应相对于 `--data_root` 目录。
+
+```bash
+videos/00000.mp4
+videos/00001.mp4
+...
+```
+
+总体而言，如果您在数据集根目录运行 `tree` 命令，您的数据集应如下所示：
+
+```bash
+/dataset
+├── prompts.txt
+├── videos.txt
+├── videos
+    ├── videos/00000.mp4
+    ├── videos/00001.mp4
+    ├── ...
+```
+
+使用此格式时，`--caption_column` 必须是 `prompts.txt`，`--video_column` 必须是 `videos.txt`。如果您的数据存储在 CSV 文件中，您也可以指定 `--dataset_file` 为 CSV 的路径，`--caption_column` 和 `--video_column` 为 CSV 文件中的实际列名。
+
+例如，让我们使用这个 [Disney 数据集](https://huggingface.co/datasets/Wild-Heart/Disney-VideoGeneration-Dataset) 进行微调。要下载，可以使用 🤗 Hugging Face CLI。
+
+```bash
+huggingface-cli download --repo-type dataset Wild-Heart/Disney-VideoGeneration-Dataset --local-dir video-dataset-disney
+```
+
+## 训练
+
+TODO
+
+请查看 `training/*.sh`
+
+注意：未在 MPS 上测试
+
+## 内存需求
+
+训练支持并验证的内存优化包括：
+
+- 来自 [TorchAO](https://github.com/pytorch/ao) 的 `CPUOffloadOptimizer`。
+- 来自 [bitsandbytes](https://huggingface.co/docs/bitsandbytes/optimizers) 的低位优化器。
+
+### LoRA 微调
+
+<details>
+<summary> AdamW </summary>
+
+With `train_batch_size = 1`:
+
+|       model        | lora rank | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:---------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |    16     |          False         |         12.945         |          43.764          |         46.918          |       24.234         |
+| THUDM/CogVideoX-2b |    16     |          True          |         12.945         |          12.945          |         21.121          |       24.234         |
+| THUDM/CogVideoX-2b |    64     |          False         |         13.035         |          44.314          |         47.469          |       24.469         |
+| THUDM/CogVideoX-2b |    64     |          True          |         13.036         |          13.035          |         21.564          |       24.500         |
+| THUDM/CogVideoX-2b |    256    |          False         |         13.095         |          45.826          |         48.990          |       25.543         |
+| THUDM/CogVideoX-2b |    256    |          True          |         13.094         |          13.095          |         22.344          |       25.537         |
+| THUDM/CogVideoX-5b |    16     |          True          |         19.742         |          19.742          |         28.746          |       38.123         |
+| THUDM/CogVideoX-5b |    64     |          True          |         20.006         |          20.818          |         30.338          |       38.738         |
+| THUDM/CogVideoX-5b |    256    |          True          |         20.771         |          22.119          |         31.939          |       41.537         |
+
+With `train_batch_size = 4`:
+
+|       model        | lora rank | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
+|:------------------:|:---------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
+| THUDM/CogVideoX-2b |    16     |          True          |         12.945         |          21.803          |         21.814          |       24.322         |
+| THUDM/CogVideoX-2b |    64     |          True          |         13.035         |          22.254          |         22.254          |       24.572         |
+| THUDM/CogVideoX-2b |    256    |          True          |         13.094         |          22.020          |         22.033          |       25.574         |
+| THUDM/CogVideoX-5b |    16     |          True          |         19.742         |          46.492          |         46.492          |       38.197         |
+| THUDM/CogVideoX-5b |    64     |          True          |         20.006         |          47.805          |         47.805          |       39.365         |
+| THUDM/CogVideoX-5b |    256    |          True          |         20.771         |          47.268          |         47.332          |       41.008         |
+
+> [!NOTE]
+> 
diff --git a/prepare_dataset.sh b/prepare_dataset.sh
@@ -2,11 +2,13 @@
 
 MODEL_ID="THUDM/CogVideoX-2b"
 
+NUM_GPUS=8
+
 # For more details on the expected data format, please refer to the README.
-DATA_ROOT="/raid/aryan/video-dataset-tom-and-jerry"  # This needs to be the path to the base directory where your videos are located.
+DATA_ROOT="/path/to/my/datasets/video-dataset"  # This needs to be the path to the base directory where your videos are located.
 CAPTION_COLUMN="prompts.txt"
 VIDEO_COLUMN="videos.txt"
-OUTPUT_DIR="/raid/aryan/video-dataset-tom-and-jerry-encoded"
+OUTPUT_DIR="/path/to/my/datasets/preprocessed-dataset"
 HEIGHT=480
 WIDTH=720
 MAX_NUM_FRAMES=49
@@ -17,19 +19,20 @@ DTYPE=fp32
 
 # To create a folder-style dataset structure without pre-encoding videos and captions'
 CMD_WITHOUT_PRE_ENCODING="\
-  python3 training/prepare_dataset.py \
-    --model_id $MODEL_ID \
-    --data_root $DATA_ROOT \
-    --caption_column $CAPTION_COLUMN \
-    --video_column $VIDEO_COLUMN \
-    --output_dir $OUTPUT_DIR \
-    --height $HEIGHT \
-    --width $WIDTH \
-    --max_num_frames $MAX_NUM_FRAMES \
-    --max_sequence_length $MAX_SEQUENCE_LENGTH \
-    --target_fps $TARGET_FPS \
-    --batch_size $BATCH_SIZE \
-    --dtype $DTYPE
+  torchrun --nproc_per_node=$NUM_GPUS \
+    training/prepare_dataset.py \
+      --model_id $MODEL_ID \
+      --data_root $DATA_ROOT \
+      --caption_column $CAPTION_COLUMN \
+      --video_column $VIDEO_COLUMN \
+      --output_dir $OUTPUT_DIR \
+      --height $HEIGHT \
+      --width $WIDTH \
+      --max_num_frames $MAX_NUM_FRAMES \
+      --max_sequence_length $MAX_SEQUENCE_LENGTH \
+      --target_fps $TARGET_FPS \
+      --batch_size $BATCH_SIZE \
+      --dtype $DTYPE
 "
 
 CMD_WITH_PRE_ENCODING="$CMD_WITHOUT_PRE_ENCODING --save_tensors"
diff --git a/training/prepare_dataset.py b/training/prepare_dataset.py