Skip to content

Commit f2a1626

Browse files
authored
improve dataset preparation (#43)
* update * update * update * update
1 parent 8c12f34 commit f2a1626

File tree

8 files changed

+129
-75
lines changed

8 files changed

+129
-75
lines changed

assets/dataset.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ The framework supports resolutions and frame counts that meet the following cond
1818
- Any resolution as long as it is divisible by 32. For example, `720 * 480`, `1920 * 1020`, etc.
1919

2020
- **Supported Frame Counts (Frames)**:
21-
- Must satisfy (4K + 1), i.e., multiples of 4 such as 16, 24, 32, 48, 64, 80.
21+
- Must be `4 * k` or `4 * k + 1` (example: 16, 32, 49, 81)
2222

2323
It is recommended to place all videos in a single folder.
2424

@@ -58,4 +58,4 @@ huggingface-cli download --repo-type dataset Wild-Heart/Disney-VideoGeneration-D
5858

5959
This dataset has been prepared in the expected format and can be used directly. However, directly using the video dataset may cause Out of Memory (OOM) issues on GPUs with smaller VRAM because it requires loading the [VAE](https://huggingface.co/THUDM/CogVideoX-5b/tree/main/vae) (which encodes videos into latent space) and the large [T5-XXL](https://huggingface.co/google/t5-v1_1-xxl/) text encoder. To reduce memory usage, you can use the `training/prepare_dataset.py` script to precompute latents and embeddings.
6060

61-
Fill or modify the parameters in `prepare_dataset.sh` and execute it to get precomputed latents and embeddings (make sure to specify `--save_tensors` to save the precomputed artifacts). When using these artifacts during training, ensure that you specify the `--load_tensors` flag, or else the videos will be used directly, requiring the text encoder and VAE to be loaded. The script also supports PyTorch DDP so that large datasets can be encoded in parallel across multiple GPUs (modify the `NUM_GPUS` parameter).
61+
Fill or modify the parameters in `prepare_dataset.sh` and execute it to get precomputed latents and embeddings (make sure to specify `--save_latents_and_embeddings` to save the precomputed artifacts). If preparing for image-to-video training, make sure to pass `--save_image_latents`, which encodes and saves image latents along with videos. When using these artifacts during training, ensure that you specify the `--load_tensors` flag, or else the videos will be used directly, requiring the text encoder and VAE to be loaded. The script also supports PyTorch DDP so that large datasets can be encoded in parallel across multiple GPUs (modify the `NUM_GPUS` parameter).

assets/dataset_zh.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ A black and white animated sequence on a ship’s deck features a bulldog charac
1818
- 任意分辨率且必须能被32整除。例如,`720 * 480`, `1920 * 1020` 等分辨率。
1919

2020
- **支持的帧数(Frames)**
21-
- 满足 (4K +1),即4的倍数,例如,16, 24, 32, 48, 64, 80。
21+
- 必须是 `4 * k``4 * k + 1`(例如:16, 32, 49, 81)
2222

2323
所有的视频建议放在一个文件夹中。
2424

@@ -66,6 +66,7 @@ OOM(内存不足),因为它需要加载 [VAE](https://huggingface.co/THUDM
6666

6767
文本编码器。为了降低内存需求,您可以使用 `training/prepare_dataset.py` 脚本预先计算潜在变量和嵌入。
6868

69-
填写或修改 `prepare_dataset.sh` 中的参数并执行它以获得预先计算的潜在变量和嵌入(请确保指定 `--save_tensors`
70-
以保存预计算的工件)。在训练期间使用这些工件时,确保指定 `--load_tensors` 标志,否则将直接使用视频并需要加载文本编码器和
69+
填写或修改 `prepare_dataset.sh` 中的参数并执行它以获得预先计算的潜在变量和嵌入(请确保指定 `--save_latents_and_embeddings`
70+
以保存预计算的工件)。如果准备图像到视频的训练,请确保传递 `--save_image_latents`,它对沙子进行编码,将图像潜在值与视频一起保存。
71+
在训练期间使用这些工件时,确保指定 `--load_tensors` 标志,否则将直接使用视频并需要加载文本编码器和
7172
VAE。该脚本还支持 PyTorch DDP,以便可以使用多个 GPU 并行编码大型数据集(修改 `NUM_GPUS` 参数)。

prepare_dataset.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ CMD_WITHOUT_PRE_ENCODING="\
3838
--dtype $DTYPE
3939
"
4040

41-
CMD_WITH_PRE_ENCODING="$CMD_WITHOUT_PRE_ENCODING --save_tensors"
41+
CMD_WITH_PRE_ENCODING="$CMD_WITHOUT_PRE_ENCODING --save_latents_and_embeddings"
4242

4343
# Select which you'd like to run
4444
CMD=$CMD_WITH_PRE_ENCODING

training/cogvideox_image_to_video_lora.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,6 @@
2626
import diffusers
2727
import torch
2828
import transformers
29-
import wandb
3029
from accelerate import Accelerator, DistributedType
3130
from accelerate.logging import get_logger
3231
from accelerate.utils import (
@@ -53,6 +52,8 @@
5352
from tqdm.auto import tqdm
5453
from transformers import AutoTokenizer, T5EncoderModel
5554

55+
import wandb
56+
5657

5758
from args import get_args # isort:skip
5859
from dataset import BucketSampler, VideoDatasetWithResizing, VideoDatasetWithResizeAndRectangleCrop # isort:skip
@@ -523,7 +524,7 @@ def load_model_hook(models, input_dir):
523524

524525
# Scheduler and math around the number of training steps.
525526
overrode_max_train_steps = False
526-
num_update_steps_per_epoch = math.ceil(len(train_dataset) / args.gradient_accumulation_steps)
527+
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
527528
if args.max_train_steps is None:
528529
args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
529530
overrode_max_train_steps = True
@@ -560,7 +561,7 @@ def load_model_hook(models, input_dir):
560561
)
561562

562563
# We need to recalculate our total training steps as the size of the training dataloader may have changed.
563-
num_update_steps_per_epoch = math.ceil(len(train_dataset) / args.gradient_accumulation_steps)
564+
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
564565
if overrode_max_train_steps:
565566
args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
566567
# Afterwards we recalculate our number of training epochs
@@ -582,6 +583,7 @@ def load_model_hook(models, input_dir):
582583
accelerator.print("***** Running training *****")
583584
accelerator.print(f" Num trainable parameters = {num_trainable_parameters}")
584585
accelerator.print(f" Num examples = {len(train_dataset)}")
586+
accelerator.print(f" Num batches each epoch = {len(train_dataloader)}")
585587
accelerator.print(f" Num epochs = {args.num_train_epochs}")
586588
accelerator.print(f" Instantaneous batch size per device = {args.train_batch_size}")
587589
accelerator.print(f" Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")

training/cogvideox_text_to_video_lora.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@
2525
import diffusers
2626
import torch
2727
import transformers
28-
import wandb
2928
from accelerate import Accelerator, DistributedType
3029
from accelerate.logging import get_logger
3130
from accelerate.utils import (
@@ -52,6 +51,8 @@
5251
from tqdm.auto import tqdm
5352
from transformers import AutoTokenizer, T5EncoderModel
5453

54+
import wandb
55+
5556

5657
from args import get_args # isort:skip
5758
from dataset import BucketSampler, VideoDatasetWithResizing, VideoDatasetWithResizeAndRectangleCrop # isort:skip
@@ -507,7 +508,7 @@ def collate_fn(data):
507508

508509
# Scheduler and math around the number of training steps.
509510
overrode_max_train_steps = False
510-
num_update_steps_per_epoch = math.ceil(len(train_dataset) / args.gradient_accumulation_steps)
511+
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
511512
if args.max_train_steps is None:
512513
args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
513514
overrode_max_train_steps = True
@@ -544,7 +545,7 @@ def collate_fn(data):
544545
)
545546

546547
# We need to recalculate our total training steps as the size of the training dataloader may have changed.
547-
num_update_steps_per_epoch = math.ceil(len(train_dataset) / args.gradient_accumulation_steps)
548+
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
548549
if overrode_max_train_steps:
549550
args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
550551
# Afterwards we recalculate our number of training epochs
@@ -566,6 +567,7 @@ def collate_fn(data):
566567
accelerator.print("***** Running training *****")
567568
accelerator.print(f" Num trainable parameters = {num_trainable_parameters}")
568569
accelerator.print(f" Num examples = {len(train_dataset)}")
570+
accelerator.print(f" Num batches each epoch = {len(train_dataloader)}")
569571
accelerator.print(f" Num epochs = {args.num_train_epochs}")
570572
accelerator.print(f" Instantaneous batch size per device = {args.train_batch_size}")
571573
accelerator.print(f" Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")

training/cogvideox_text_to_video_sft.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@
2525
import diffusers
2626
import torch
2727
import transformers
28-
import wandb
2928
from accelerate import Accelerator, DistributedType
3029
from accelerate.logging import get_logger
3130
from accelerate.utils import (
@@ -51,6 +50,8 @@
5150
from tqdm.auto import tqdm
5251
from transformers import AutoTokenizer, T5EncoderModel
5352

53+
import wandb
54+
5455

5556
from args import get_args # isort:skip
5657
from dataset import BucketSampler, VideoDatasetWithResizing, VideoDatasetWithResizeAndRectangleCrop # isort:skip
@@ -471,7 +472,7 @@ def collate_fn(data):
471472

472473
# Scheduler and math around the number of training steps.
473474
overrode_max_train_steps = False
474-
num_update_steps_per_epoch = math.ceil(len(train_dataset) / args.gradient_accumulation_steps)
475+
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
475476
if args.max_train_steps is None:
476477
args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
477478
overrode_max_train_steps = True
@@ -508,7 +509,7 @@ def collate_fn(data):
508509
)
509510

510511
# We need to recalculate our total training steps as the size of the training dataloader may have changed.
511-
num_update_steps_per_epoch = math.ceil(len(train_dataset) / args.gradient_accumulation_steps)
512+
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
512513
if overrode_max_train_steps:
513514
args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
514515
# Afterwards we recalculate our number of training epochs
@@ -530,6 +531,7 @@ def collate_fn(data):
530531
accelerator.print("***** Running training *****")
531532
accelerator.print(f" Num trainable parameters = {num_trainable_parameters}")
532533
accelerator.print(f" Num examples = {len(train_dataset)}")
534+
accelerator.print(f" Num batches each epoch = {len(train_dataloader)}")
533535
accelerator.print(f" Num epochs = {args.num_train_epochs}")
534536
accelerator.print(f" Instantaneous batch size per device = {args.train_batch_size}")
535537
accelerator.print(f" Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")

training/dataset.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -375,7 +375,7 @@ class BucketSampler(Sampler):
375375
be yielded. If set to False, it is guaranteed that all data in the dataset will be processed
376376
and batches that do not have `batch_size` number of entries will also be yielded.
377377
"""
378-
378+
379379
def __init__(
380380
self, data_source: VideoDataset, batch_size: int = 8, shuffle: bool = True, drop_last: bool = False
381381
) -> None:
@@ -386,6 +386,16 @@ def __init__(
386386

387387
self.buckets = {resolution: [] for resolution in data_source.resolutions}
388388

389+
self._raised_warning_for_drop_last = False
390+
391+
def __len__(self):
392+
if self.drop_last and not self._raised_warning_for_drop_last:
393+
self._raised_warning_for_drop_last = True
394+
logger.warning(
395+
"Calculating the length for bucket sampler is not possible when `drop_last` is set to True. This may cause problems when setting the number of epochs used for training."
396+
)
397+
return (len(self.data_source) + self.batch_size - 1) // self.batch_size
398+
389399
def __iter__(self):
390400
for index, data in enumerate(self.data_source):
391401
video_metadata = data["video_metadata"]

0 commit comments

Comments
 (0)