Merge pull request #19 from a-r-r-o-w/zR-dev

zRzRzRzRzRzRzR · web-flow · commit e50cb9c4abae · 2024-10-10T22:15:48.000+08:00
Darft of Chinese README
diff --git a/.github/ISSUE_TEMPLATE/bug_report.yaml b/.github/ISSUE_TEMPLATE/bug_report.yaml
@@ -0,0 +1,51 @@
+name: "\U0001F41B Bug Report"
+description: Submit a bug report to help us improve CogVideoX-Factory / 提交一个 Bug 问题报告来帮助我们改进 CogVideoX-Factory 开源框架
+body:
+  - type: textarea
+    id: system-info
+    attributes:
+      label: System Info / 系統信息
+      description: Your operating environment / 您的运行环境信息
+      placeholder: Includes Cuda version, Diffusers version, Python version, operating system, hardware information (if you suspect a hardware problem)... / 包括Cuda版本，Diffusers，Python版本，操作系统，硬件信息(如果您怀疑是硬件方面的问题)...
+    validations:
+      required: true
+
+  - type: checkboxes
+    id: information-scripts-examples
+    attributes:
+      label: Information / 问题信息
+      description: 'The problem arises when using: / 问题出现在'
+      options:
+        - label: "The official example scripts / 官方的示例脚本"
+        - label: "My own modified scripts / 我自己修改的脚本和任务"
+
+  - type: textarea
+    id: reproduction
+    validations:
+      required: true
+    attributes:
+      label: Reproduction / 复现过程
+      description: |
+        Please provide a code example that reproduces the problem you encountered, preferably with a minimal reproduction unit.
+        If you have code snippets, error messages, stack traces, please provide them here as well.
+        Please format your code correctly using code tags. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
+        Do not use screenshots, as they are difficult to read and (more importantly) do not allow others to copy and paste your code.
+        
+        请提供能重现您遇到的问题的代码示例,最好是最小复现单元。
+        如果您有代码片段、错误信息、堆栈跟踪，也请在此提供。
+        请使用代码标签正确格式化您的代码。请参见 https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
+        请勿使用截图，因为截图难以阅读，而且（更重要的是）不允许他人复制粘贴您的代码。
+      placeholder: |
+        Steps to reproduce the behavior/复现Bug的步骤:
+          
+          1.
+          2.
+          3.
+
+  - type: textarea
+    id: expected-behavior
+    validations:
+      required: true
+    attributes:
+      label: Expected behavior / 期待表现
+      description: "A clear and concise description of what you would expect to happen. /简单描述您期望发生的事情。"
diff --git a/.github/ISSUE_TEMPLATE/feature-request.yaml b/.github/ISSUE_TEMPLATE/feature-request.yaml
@@ -0,0 +1,34 @@
+name: "\U0001F680 Feature request"
+description: Submit a request for a new CogVideoX-Factory feature / 提交一个新的 CogVideoX-Factory 开源项目的功能建议
+labels: [ "feature" ]
+body:
+  - type: textarea
+    id: feature-request
+    validations:
+      required: true
+    attributes:
+      label: Feature request  / 功能建议
+      description: |
+        A brief description of the functional proposal. Links to corresponding papers and code are desirable.
+        对功能建议的简述。最好提供对应的论文和代码链接。
+
+  - type: textarea
+    id: motivation
+    validations:
+      required: true
+    attributes:
+      label: Motivation / 动机
+      description: |
+        Your motivation for making the suggestion. If that motivation is related to another GitHub issue, link to it here.
+        您提出建议的动机。如果该动机与另一个 GitHub 问题有关，请在此处提供对应的链接。
+
+  - type: textarea
+    id: contribution
+    validations:
+      required: true
+    attributes:
+      label: Your contribution / 您的贡献
+      description: |
+        
+        Your PR link or any other link you can help with.
+        您的PR链接或者其他您能提供帮助的链接。
diff --git a/README.md b/README.md
@@ -1,5 +1,7 @@
 # CogVideoX Factory 🧪
 
+[中文阅读](./README_zh.md)
+
 Fine-tune Cog family of video models for custom video generation under 24GB of GPU memory ⚡️📼
 
 <table align="center">
diff --git a/README_zh.md b/README_zh.md
@@ -1,32 +1,84 @@
-# CogVideoX Factory
+# CogVideoX Factory 🧪
 
-## 简介
+[Read this in English](./README_zh.md)
 
-这是用于 CogVideoX 微调的仓库。
+在 24GB GPU 内存下微调 Cog 系列视频模型以生成自定义视频 ⚡️📼
+
+<table align="center">
+<tr>
+  <td align="center"><video src="https://github.com/user-attachments/assets/aad07161-87cb-4784-9e6b-16d06581e3e5">Your browser does not support the video tag.</video></td>
+</tr>
+</table>
+
+
+## 快速开始
+
+克隆此仓库并确保已安装所有依赖：`pip install -r requirements.txt`。
+
+然后下载数据集：
+
+```bash
+# 安装 `huggingface_hub`
+huggingface-cli download   --repo-type dataset Wild-Heart/Disney-VideoGeneration-Dataset   --local-dir video-dataset-disney
+```
+
+然后启动文本到视频的 LoRA 微调（根据您的需求修改不同的超参数、数据集根目录和其他配置选项）：
+
+```bash
+# 对 CogVideoX 文本到视频模型进行 LoRA 微调
+./train_text_to_video_lora.sh
+
+# 对 CogVideoX 文本到视频模型进行全微调
+./train_text_to_video_sft.sh
+
+# 对 CogVideoX 图像到视频模型进行 LoRA 微调
+./train_image_to_video_lora.sh
+```
+
+假设您的 LoRA 已保存并推送到 HF Hub，并命名为 `my-awesome-name/my-awesome-lora`，我们现在可以使用微调后的模型进行推理：
+
+```diff
+import torch
+from diffusers import CogVideoXPipeline
+from diffusers import export_to_video
+
+pipe = CogVideoXPipeline.from_pretrained(
+    "THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16
+).to("cuda")
++ pipe.load_lora_weights("my-awesome-name/my-awesome-lora", adapter_name=["cogvideox-lora"])
++ pipe.set_adapters(["cogvideox-lora"], [1.0])
+
+video = pipe("<my-awesome-prompt>").frames[0]
+export_to_video(video, "output.mp4", fps=8)
+```
+
+**注意：** 对于图像到视频的微调，您必须从 [此](https://github.com/huggingface/diffusers/pull/9482) 分支安装 diffusers（该分支添加了 CogVideoX 图像到视频的 LoRA 加载支持），直到它被合并。
+
+在下方的部分中，我们提供了在本仓库中探索的更多选项的详细信息。它们都试图通过尽可能减少内存需求，使视频模型的微调变得尽可能容易。
 
 ## 数据集准备
 
-创建两个文件，一个文件包含以换行符分隔的提示词，另一个文件包含以换行符分隔的视频数据路径（视频文件的路径必须相对于您在指定 `--data_root` 时传递的路径）。让我们通过一个例子来更好地理解这一点！
+创建两个文件，一个文件包含逐行分隔的提示，另一个文件包含逐行分隔的视频数据路径（视频文件的路径必须相对于您在指定 `--data_root` 时传递的路径）。让我们通过一个示例来更好地理解这一点！
 
-假设您将 `--data_root` 指定为 `/dataset`，并且该目录包含文件：`prompts.txt` 和 `videos.txt`。
+假设您指定的 `--data_root` 为 `/dataset`，并且该目录包含以下文件：`prompts.txt` 和 `videos.txt`。
 
-`prompts.txt` 文件应包含以换行符分隔的提示词：
+`prompts.txt` 文件应包含逐行分隔的提示：
 
 ```
-一段黑白动画序列，主角是一只名为 Rabbity Ribfried 的兔子和一只拟人化的山羊，在一个充满音乐和趣味的环境中，展示他们不断发展的互动。
-一段黑白动画序列，场景在船甲板上，主角是一只名为 Bully Bulldoger 的斗牛犬角色，展示了夸张的面部表情和肢体语言。角色从自信到专注，再到紧张和痛苦，展示了一系列情绪，随着它克服挑战。船的内部在背景中保持静止，只有简单的细节，如钟声和开着的门。角色的动态动作和变化的表情推动了故事的发展，没有镜头移动，确保观众专注于其不断变化的反应和肢体动作。
+一段黑白动画序列，主角是一只名为 Rabbity Ribfried 的兔子和一只拟人化的山羊，展示了它们在音乐与游戏环境中的互动演变。
+一段黑白动画序列，发生在船甲板上，主角是一只名为 Bully Bulldoger 的斗牛犬，展现了夸张的面部表情和肢体语言。角色从自信、专注逐渐转变为紧张与痛苦，展示了随着挑战出现的情感变化。船的内部在背景中保持静止，只有一些简单的细节，如钟声和敞开的门。角色的动态动作和不断变化的表情推动了叙事，没有摄像机运动来分散注意力。
 ...
 ```
 
-`videos.txt` 文件应包含以换行符分隔的视频文件路径。请注意，路径应相对于 `--data_root` 目录。
+`videos.txt` 文件应包含逐行分隔的视频文件路径。请注意，路径应相对于 `--data_root` 目录。
 
 ```bash
 videos/00000.mp4
 videos/00001.mp4
 ...
 ```
 
-总体而言，如果您在数据集根目录运行 `tree` 命令，您的数据集应如下所示：
+整体而言，如果在数据集根目录运行 `tree` 命令，您的数据集应如下所示：
 
 ```bash
 /dataset
@@ -38,58 +90,52 @@ videos/00001.mp4
     ├── ...
 ```
 
-使用此格式时，`--caption_column` 必须是 `prompts.txt`，`--video_column` 必须是 `videos.txt`。如果您的数据存储在 CSV 文件中，您也可以指定 `--dataset_file` 为 CSV 的路径，`--caption_column` 和 `--video_column` 为 CSV 文件中的实际列名。
+使用此格式时，`--caption_column` 必须是 `prompts.txt`，`--video_column` 必须是 `videos.txt`。如果您将数据存储在 CSV 文件中，还可以指定 `--dataset_file` 为 CSV 的路径，`--caption_column` 和 `--video_column` 为 CSV 文件中的实际列名。
 
-例如，让我们使用这个 [Disney 数据集](https://huggingface.co/datasets/Wild-Heart/Disney-VideoGeneration-Dataset) 进行微调。要下载，可以使用 🤗 Hugging Face CLI。
+例如，让我们使用[这个](https://huggingface.co/datasets/Wild-Heart/Disney-VideoGeneration-Dataset) Disney 数据集进行微调。要下载，您可以使用 🤗 Hugging Face CLI。
 
 ```bash
 huggingface-cli download --repo-type dataset Wild-Heart/Disney-VideoGeneration-Dataset --local-dir video-dataset-disney
 ```
 
+TODO：添加一个关于创建和使用预计算嵌入的部分。
+
 ## 训练
 
-TODO
+我们提供了与 [Cog 系列模型](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) 兼容的文本到视频和图像到视频生成的训练脚本。
 
-请查看 `training/*.sh`
+查看 `*.sh` 文件。
 
-注意：未在 MPS 上测试
+注意：本代码未在 MPS 上测试，建议在 Linux 环境下使用 CUDA文件测试。
 
 ## 内存需求
 
-训练支持并验证的内存优化包括：
+<table align="center">
+<tr>
+  <td align="center"><a href="https://www.youtube.com/watch?v=UvRl4ansfCg"> 使用 PyTorch 消除 OOM</a></td>
+</tr>
+<tr>
+  <td align="center"><img src="assets/slaying-ooms.png" style="width: 480px; height: 480px;"></td>
+</tr>
+</table>
 
-- 来自 [TorchAO](https://github.com/pytorch/ao) 的 `CPUOffloadOptimizer`。
-- 来自 [bitsandbytes](https://huggingface.co/docs/bitsandbytes/optimizers) 的低位优化器。
+支持和验证的内存优化训练选项包括：
 
-### LoRA 微调
+- [`torchao`](https://github.com/pytorch/ao) 中的 `CPUOffloadOptimizer`。您可以阅读它的能力和限制 [此处](https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload)。简而言之，它允许您使用 CPU 存储可训练的参数和梯度。这导致优化器步骤在 CPU 上进行，需要一个快速的 CPU 优化器，例如 `torch.optim.AdamW(fused=True)` 或在优化器步骤上应用 `torch.compile`。此外，建议不要将模型编译用于训练。梯度裁剪和积累尚不支持。
+- [`bitsandbytes`](https://huggingface.co/docs/bitsandbytes/optimizers) 中的低位优化器。
+  - TODO：测试并使 [`torchao`](https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim) 工作
+- DeepSpeed Zero2：由于我们依赖 `accelerate`，请按照[本指南](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed) 配置 `accelerate` 以启用 DeepSpeed Zero2 优化。
+
+> [!IMPORTANT]
+> 内存需求是在运行 `training/prepare_dataset.py` 后报告的，它将视频和字幕转换为潜变量和嵌入。在训练过程中，我们直接加载潜变量和嵌入，而不需要 VAE 或 T5 文本编码器。但是，如果您执行验证/测试，则必须加载这些内容，并增加所需的内存量。不执行验证/测试可以节省大量内存，对于使用较小 VRAM 的 GPU，这可以用于专注于训练。
+>
+> 如果您选择运行验证/测试，可以通过指定 `--enable_model_cpu_offload` 在较低 VRAM 的 GPU 上节省一些内存。
 
-<details>
-<summary> AdamW </summary>
-
-With `train_batch_size = 1`:
-
-|       model        | lora rank | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
-|:------------------:|:---------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
-| THUDM/CogVideoX-2b |    16     |          False         |         12.945         |          43.764          |         46.918          |       24.234         |
-| THUDM/CogVideoX-2b |    16     |          True          |         12.945         |          12.945          |         21.121          |       24.234         |
-| THUDM/CogVideoX-2b |    64     |          False         |         13.035         |          44.314          |         47.469          |       24.469         |
-| THUDM/CogVideoX-2b |    64     |          True          |         13.036         |          13.035          |         21.564          |       24.500         |
-| THUDM/CogVideoX-2b |    256    |          False         |         13.095         |          45.826          |         48.990          |       25.543         |
-| THUDM/CogVideoX-2b |    256    |          True          |         13.094         |          13.095          |         22.344          |       25.537         |
-| THUDM/CogVideoX-5b |    16     |          True          |         19.742         |          19.742          |         28.746          |       38.123         |
-| THUDM/CogVideoX-5b |    64     |          True          |         20.006         |          20.818          |         30.338          |       38.738         |
-| THUDM/CogVideoX-5b |    256    |          True          |         20.771         |          22.119          |         31.939          |       41.537         |
-
-With `train_batch_size = 4`:
-
-|       model        | lora rank | gradient_checkpointing | memory_before_training | memory_before_validation | memory_after_validation | memory_after_testing |
-|:------------------:|:---------:|:----------------------:|:----------------------:|:------------------------:|:-----------------------:|:--------------------:|
-| THUDM/CogVideoX-2b |    16     |          True          |         12.945         |          21.803          |         21.814          |       24.322         |
-| THUDM/CogVideoX-2b |    64     |          True          |         13.035         |          22.254          |         22.254          |       24.572         |
-| THUDM/CogVideoX-2b |    256    |          True          |         13.094         |          22.020          |         22.033          |       25.574         |
-| THUDM/CogVideoX-5b |    16     |          True          |         19.742         |          46.492          |         46.492          |       38.197         |
-| THUDM/CogVideoX-5b |    64     |          True          |         20.006         |          47.805          |         47.805          |       39.365         |
-| THUDM/CogVideoX-5b |    256    |          True          |         20.771         |          47.268          |         47.332          |       41.008         |
+### LoRA 微调
 
 > [!NOTE]
-> 
+> 图像到视频 LoRA 微调的内存需求与 `THUDM/CogVideoX-5b` 上的文本到视频类似，因此未明确报告。
+>
+> I2V训练会使用视频的第一帧进行微调。 要为 I2V 微调准备测试图像，您可以通过修改脚本动态生成它们，或使用以下命令从您的训练数据中提取一些帧：
+> `ffmpeg -i input.mp4 -frames:v 1 frame.png`，
+> 或提供一个有效且可访问的图像 URL。
diff --git a/assets/contribute.md b/assets/contribute.md
@@ -0,0 +1,3 @@
+# 欢迎你们的贡献
+
+本项目属于非常初级的阶段
diff --git a/requirements.txt b/requirements.txt
@@ -1,17 +1,16 @@
-accelerate
-bitsandbytes
-diffusers
-transformers
-huggingface_hub
-hf_transfer
-peft
-decord
-wandb
-pandas
-torch
-torchvision
-torchao
-sentencepiece
-imageio-ffmpeg
-imageio
-numpy==2.1.1
+accelerate>=1.0.0
+bitsandbytes>=0.44.1
+diffusers>=0.30.4
+transformers>=0.45.2
+huggingface_hub>=0.25.2
+hf_transfer>=0.1.8
+peft>=0.13.1
+decord>=0.6.0
+wandb>=0.18.3
+pandas>=2.4.1
+torch>=2.4.0
+torchvision>=0.19.0
+torchao>=0.5.0
+sentencepiece>=0.2.0
+imageio-ffmpeg>=0.5.1
+numpy>=2.1.2

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# 欢迎你们的贡献`
	`2`	`+`
	`3`	`+本项目属于非常初级的阶段`