huggingface
diff --git a/‎README.md‎
Lines changed: 15 additions & 1 deletion b/‎README.md‎
Lines changed: 15 additions & 1 deletion
diff --git a/‎prepare_dataset.sh‎
Lines changed: 2 additions & 1 deletion b/‎prepare_dataset.sh‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎train_image_to_video_lora.sh‎
Lines changed: 82 additions & 0 deletions b/‎train_image_to_video_lora.sh‎
Lines changed: 82 additions & 0 deletions
diff --git a/‎training/args.py‎
Lines changed: 14 additions & 2 deletions b/‎training/args.py‎
Lines changed: 14 additions & 2 deletions
@@ -109,10 +109,17 @@ Supported and verified memory optimizations for training include:
 > [!IMPORTANT]
 > The memory requirements are reported after running the `training/prepare_dataset.py`, which converts the videos and captions to latents and embeddings. During training, we directly load the latents and embeddings, and do not require the VAE or the T5 text encoder. However, if you perform validation/testing, these must be loaded and increase the amount of required memory. Not performing validation/testing saves a significant amount of memory, which can be used to focus solely on training if you're on smaller VRAM GPUs.
 >
-> If you choose to run validation/testing, you can save some memory on lower VRAM GPUs by specifying `--enable_model_cpu_offloading`.
+> If you choose to run validation/testing, you can save some memory on lower VRAM GPUs by specifying `--enable_model_cpu_offload`.
 
 ### LoRA finetuning
 
+> [!NOTE]
+> The memory requirements for image-to-video lora finetuning are similar to that of text-to-video on `THUDM/CogVideoX-5b`, so it hasn't been reported explicitly.
+>
+> Additionally, to prepare test images for I2V finetuning, you could either generate them on-the-fly by modifying the script, or extract some frames from your training data using:
+> `ffmpeg -i input.mp4 -frames:v 1 frame.png`,
+> or provide a URL to a valid and accessible image.
+
 <details>
 <summary> AdamW </summary>
 
@@ -308,6 +315,13 @@ With `train_batch_size = 4`:
 
 ### Full finetuning
 
+> [!NOTE]
+> The memory requirements for image-to-video full finetuning are similar to that of text-to-video on `THUDM/CogVideoX-5b`, so it hasn't been reported explicitly.
+>
+> Additionally, to prepare test images for I2V finetuning, you could either generate them on-the-fly by modifying the script, or extract some frames from your training data using:
+> `ffmpeg -i input.mp4 -frames:v 1 frame.png`,
+> or provide a URL to a valid and accessible image.
+
 > [!NOTE]
 > Trying to run full finetuning without gradient checkpointing OOMs even on an A100 (80 GB), so the memory measurements have not been specified.
 
 
@@ -17,7 +17,8 @@ TARGET_FPS=8
 BATCH_SIZE=1
 DTYPE=fp32
 
-# To create a folder-style dataset structure without pre-encoding videos and captions'
+# To create a folder-style dataset structure without pre-encoding videos and captions
+# For Image-to-Video finetuning, make sure to pass `--save_image_latents`
 CMD_WITHOUT_PRE_ENCODING="\
   torchrun --nproc_per_node=$NUM_GPUS \
     training/prepare_dataset.py \
 
@@ -0,0 +1,82 @@
+export TORCH_LOGS="+dynamo,recompiles,graph_breaks"
+export TORCHDYNAMO_VERBOSE=1
+export WANDB_MODE="offline"
+export NCCL_P2P_DISABLE=1
+export TORCH_NCCL_ENABLE_MONITORING=0
+
+GPU_IDS="0"
+
+# Training Configurations
+# Experiment with as many hyperparameters as you want!
+LEARNING_RATES=("1e-4" "1e-3")
+LR_SCHEDULES=("cosine_with_restarts")
+OPTIMIZERS=("adamw", "adam")
+MAX_TRAIN_STEPS=("3000")
+
+# Single GPU uncompiled training
+ACCELERATE_CONFIG_FILE="accelerate_configs/uncompiled_1.yaml"
+
+# Absolute path to where the data is located. Make sure to have read the README for how to prepare data.
+# This example assumes you downloaded an already prepared dataset from HF CLI as follows:
+#   huggingface-cli download --repo-type dataset Wild-Heart/Disney-VideoGeneration-Dataset --local-dir /path/to/my/datasets/disney-dataset
+DATA_ROOT="/path/to/my/datasets/disney-dataset"
+CAPTION_COLUMN="prompt.txt"
+VIDEO_COLUMN="videos.txt"
+
+# Launch experiments with different hyperparameters
+for learning_rate in "${LEARNING_RATES[@]}"; do
+  for lr_schedule in "${LR_SCHEDULES[@]}"; do
+    for optimizer in "${OPTIMIZERS[@]}"; do
+      for steps in "${MAX_TRAIN_STEPS[@]}"; do
+        output_dir="/path/to/my/models/cogvideox-lora__optimizer_${optimizer}__steps_${steps}__lr-schedule_${lr_schedule}__learning-rate_${learning_rate}/"
+
+        cmd="accelerate launch --config_file $ACCELERATE_CONFIG_FILE --gpu_ids $GPU_IDS training/cogvideox_image_to_video_lora.py \
+          --pretrained_model_name_or_path THUDM/CogVideoX-5b-I2V \
+          --data_root $DATA_ROOT \
+          --caption_column $CAPTION_COLUMN \
+          --video_column $VIDEO_COLUMN \
+          --id_token BW_STYLE \
+          --height_buckets 480 \
+          --width_buckets 720 \
+          --frame_buckets 49 \
+          --dataloader_num_workers 8 \
+          --pin_memory \
+          --validation_prompt \"BW_STYLE A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions:::BW_STYLE A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance\" \
+          --validation_images \"/path/to/image1.png:::/path/to/image2.png\"
+          --validation_prompt_separator ::: \
+          --num_validation_videos 1 \
+          --validation_epochs 10 \
+          --seed 42 \
+          --rank 128 \
+          --lora_alpha 128 \
+          --mixed_precision bf16 \
+          --output_dir $output_dir \
+          --max_num_frames 49 \
+          --train_batch_size 1 \
+          --max_train_steps $steps \
+          --checkpointing_steps 1000 \
+          --gradient_accumulation_steps 1 \
+          --gradient_checkpointing \
+          --learning_rate $learning_rate \
+          --lr_scheduler $lr_schedule \
+          --lr_warmup_steps 400 \
+          --lr_num_cycles 1 \
+          --enable_slicing \
+          --enable_tiling \
+          --noised_image_dropout 0.05 \
+          --optimizer $optimizer \
+          --beta1 0.9 \
+          --beta2 0.95 \
+          --weight_decay 0.001 \
+          --max_grad_norm 1.0 \
+          --allow_tf32 \
+          --report_to wandb \
+          --nccl_timeout 1800"
+        
+        echo "Running command: $cmd"
+        eval $cmd
+        echo -ne "-------------------- Finished executing script --------------------\n\n"
+      done
+    done
+  done
+done
@@ -110,6 +110,12 @@ def _get_validation_args(parser: argparse.ArgumentParser) -> None:
         default=None,
         help="One or more prompt(s) that is used during validation to verify that the model is learning. Multiple validation prompts should be separated by the '--validation_prompt_seperator' string.",
     )
+    parser.add_argument(
+        "--validation_images",
+        type=str,
+        default=None,
+        help="One or more image path(s)/URLs that is used during validation to verify that the model is learning. Multiple validation paths should be separated by the '--validation_prompt_seperator' string. These should correspond to the order of the validation prompts.",
+    )
     parser.add_argument(
         "--validation_prompt_separator",
         type=str,
@@ -141,10 +147,10 @@ def _get_validation_args(parser: argparse.ArgumentParser) -> None:
         help="Whether or not to use the default cosine dynamic guidance schedule when sampling validation videos.",
     )
     parser.add_argument(
-        "--enable_model_cpu_offloading",
+        "--enable_model_cpu_offload",
         action="store_true",
         default=False,
-        help="Whether or not to enable model-wise CPU offloading when performing validation/testing to save memory."
+        help="Whether or not to enable model-wise CPU offloading when performing validation/testing to save memory.",
     )
 
 
@@ -305,6 +311,12 @@ def _get_training_args(parser: argparse.ArgumentParser) -> None:
         default=False,
         help="Whether or not to use VAE tiling for saving memory.",
     )
+    parser.add_argument(
+        "--noised_image_dropout",
+        type=float,
+        default=0.05,
+        help="Image condition dropout probability when finetuning image-to-video.",
+    )
 
 
 def _get_optimizer_args(parser: argparse.ArgumentParser) -> None: