Skip to content

Commit b1b72c0

Browse files
authored
CogVideoX I2V; CPU offloading; Model README descriptions (#11)
* update * update * update readme * update * update * update model desc * update * Update training/prepare_dataset.py
1 parent cc1d2e7 commit b1b72c0

File tree

9 files changed

+1181
-33
lines changed

9 files changed

+1181
-33
lines changed

README.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,10 +109,17 @@ Supported and verified memory optimizations for training include:
109109
> [!IMPORTANT]
110110
> The memory requirements are reported after running the `training/prepare_dataset.py`, which converts the videos and captions to latents and embeddings. During training, we directly load the latents and embeddings, and do not require the VAE or the T5 text encoder. However, if you perform validation/testing, these must be loaded and increase the amount of required memory. Not performing validation/testing saves a significant amount of memory, which can be used to focus solely on training if you're on smaller VRAM GPUs.
111111
>
112-
> If you choose to run validation/testing, you can save some memory on lower VRAM GPUs by specifying `--enable_model_cpu_offloading`.
112+
> If you choose to run validation/testing, you can save some memory on lower VRAM GPUs by specifying `--enable_model_cpu_offload`.
113113
114114
### LoRA finetuning
115115

116+
> [!NOTE]
117+
> The memory requirements for image-to-video lora finetuning are similar to that of text-to-video on `THUDM/CogVideoX-5b`, so it hasn't been reported explicitly.
118+
>
119+
> Additionally, to prepare test images for I2V finetuning, you could either generate them on-the-fly by modifying the script, or extract some frames from your training data using:
120+
> `ffmpeg -i input.mp4 -frames:v 1 frame.png`,
121+
> or provide a URL to a valid and accessible image.
122+
116123
<details>
117124
<summary> AdamW </summary>
118125

@@ -308,6 +315,13 @@ With `train_batch_size = 4`:
308315

309316
### Full finetuning
310317

318+
> [!NOTE]
319+
> The memory requirements for image-to-video full finetuning are similar to that of text-to-video on `THUDM/CogVideoX-5b`, so it hasn't been reported explicitly.
320+
>
321+
> Additionally, to prepare test images for I2V finetuning, you could either generate them on-the-fly by modifying the script, or extract some frames from your training data using:
322+
> `ffmpeg -i input.mp4 -frames:v 1 frame.png`,
323+
> or provide a URL to a valid and accessible image.
324+
311325
> [!NOTE]
312326
> Trying to run full finetuning without gradient checkpointing OOMs even on an A100 (80 GB), so the memory measurements have not been specified.
313327

prepare_dataset.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,8 @@ TARGET_FPS=8
1717
BATCH_SIZE=1
1818
DTYPE=fp32
1919

20-
# To create a folder-style dataset structure without pre-encoding videos and captions'
20+
# To create a folder-style dataset structure without pre-encoding videos and captions
21+
# For Image-to-Video finetuning, make sure to pass `--save_image_latents`
2122
CMD_WITHOUT_PRE_ENCODING="\
2223
torchrun --nproc_per_node=$NUM_GPUS \
2324
training/prepare_dataset.py \

train_image_to_video_lora.sh

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
export TORCH_LOGS="+dynamo,recompiles,graph_breaks"
2+
export TORCHDYNAMO_VERBOSE=1
3+
export WANDB_MODE="offline"
4+
export NCCL_P2P_DISABLE=1
5+
export TORCH_NCCL_ENABLE_MONITORING=0
6+
7+
GPU_IDS="0"
8+
9+
# Training Configurations
10+
# Experiment with as many hyperparameters as you want!
11+
LEARNING_RATES=("1e-4" "1e-3")
12+
LR_SCHEDULES=("cosine_with_restarts")
13+
OPTIMIZERS=("adamw", "adam")
14+
MAX_TRAIN_STEPS=("3000")
15+
16+
# Single GPU uncompiled training
17+
ACCELERATE_CONFIG_FILE="accelerate_configs/uncompiled_1.yaml"
18+
19+
# Absolute path to where the data is located. Make sure to have read the README for how to prepare data.
20+
# This example assumes you downloaded an already prepared dataset from HF CLI as follows:
21+
# huggingface-cli download --repo-type dataset Wild-Heart/Disney-VideoGeneration-Dataset --local-dir /path/to/my/datasets/disney-dataset
22+
DATA_ROOT="/path/to/my/datasets/disney-dataset"
23+
CAPTION_COLUMN="prompt.txt"
24+
VIDEO_COLUMN="videos.txt"
25+
26+
# Launch experiments with different hyperparameters
27+
for learning_rate in "${LEARNING_RATES[@]}"; do
28+
for lr_schedule in "${LR_SCHEDULES[@]}"; do
29+
for optimizer in "${OPTIMIZERS[@]}"; do
30+
for steps in "${MAX_TRAIN_STEPS[@]}"; do
31+
output_dir="/path/to/my/models/cogvideox-lora__optimizer_${optimizer}__steps_${steps}__lr-schedule_${lr_schedule}__learning-rate_${learning_rate}/"
32+
33+
cmd="accelerate launch --config_file $ACCELERATE_CONFIG_FILE --gpu_ids $GPU_IDS training/cogvideox_image_to_video_lora.py \
34+
--pretrained_model_name_or_path THUDM/CogVideoX-5b-I2V \
35+
--data_root $DATA_ROOT \
36+
--caption_column $CAPTION_COLUMN \
37+
--video_column $VIDEO_COLUMN \
38+
--id_token BW_STYLE \
39+
--height_buckets 480 \
40+
--width_buckets 720 \
41+
--frame_buckets 49 \
42+
--dataloader_num_workers 8 \
43+
--pin_memory \
44+
--validation_prompt \"BW_STYLE A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions:::BW_STYLE A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance\" \
45+
--validation_images \"/path/to/image1.png:::/path/to/image2.png\"
46+
--validation_prompt_separator ::: \
47+
--num_validation_videos 1 \
48+
--validation_epochs 10 \
49+
--seed 42 \
50+
--rank 128 \
51+
--lora_alpha 128 \
52+
--mixed_precision bf16 \
53+
--output_dir $output_dir \
54+
--max_num_frames 49 \
55+
--train_batch_size 1 \
56+
--max_train_steps $steps \
57+
--checkpointing_steps 1000 \
58+
--gradient_accumulation_steps 1 \
59+
--gradient_checkpointing \
60+
--learning_rate $learning_rate \
61+
--lr_scheduler $lr_schedule \
62+
--lr_warmup_steps 400 \
63+
--lr_num_cycles 1 \
64+
--enable_slicing \
65+
--enable_tiling \
66+
--noised_image_dropout 0.05 \
67+
--optimizer $optimizer \
68+
--beta1 0.9 \
69+
--beta2 0.95 \
70+
--weight_decay 0.001 \
71+
--max_grad_norm 1.0 \
72+
--allow_tf32 \
73+
--report_to wandb \
74+
--nccl_timeout 1800"
75+
76+
echo "Running command: $cmd"
77+
eval $cmd
78+
echo -ne "-------------------- Finished executing script --------------------\n\n"
79+
done
80+
done
81+
done
82+
done

training/args.py

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,12 @@ def _get_validation_args(parser: argparse.ArgumentParser) -> None:
110110
default=None,
111111
help="One or more prompt(s) that is used during validation to verify that the model is learning. Multiple validation prompts should be separated by the '--validation_prompt_seperator' string.",
112112
)
113+
parser.add_argument(
114+
"--validation_images",
115+
type=str,
116+
default=None,
117+
help="One or more image path(s)/URLs that is used during validation to verify that the model is learning. Multiple validation paths should be separated by the '--validation_prompt_seperator' string. These should correspond to the order of the validation prompts.",
118+
)
113119
parser.add_argument(
114120
"--validation_prompt_separator",
115121
type=str,
@@ -141,10 +147,10 @@ def _get_validation_args(parser: argparse.ArgumentParser) -> None:
141147
help="Whether or not to use the default cosine dynamic guidance schedule when sampling validation videos.",
142148
)
143149
parser.add_argument(
144-
"--enable_model_cpu_offloading",
150+
"--enable_model_cpu_offload",
145151
action="store_true",
146152
default=False,
147-
help="Whether or not to enable model-wise CPU offloading when performing validation/testing to save memory."
153+
help="Whether or not to enable model-wise CPU offloading when performing validation/testing to save memory.",
148154
)
149155

150156

@@ -305,6 +311,12 @@ def _get_training_args(parser: argparse.ArgumentParser) -> None:
305311
default=False,
306312
help="Whether or not to use VAE tiling for saving memory.",
307313
)
314+
parser.add_argument(
315+
"--noised_image_dropout",
316+
type=float,
317+
default=0.05,
318+
help="Image condition dropout probability when finetuning image-to-video.",
319+
)
308320

309321

310322
def _get_optimizer_args(parser: argparse.ArgumentParser) -> None:

0 commit comments

Comments
 (0)