Full Finetuning for LTX possibily extended to other models. (#192)

ArEnSc · sayakpaul · a-r-r-o-w · web-flow · commit e5df80cc3674 · 2025-01-13T23:55:39.000+05:30
* Full Finetuning for LTX possibily extended to other models.

* Change name of the flag

* Used disable grad for component on lora fine tuning enabled

* Suggestions Addressed
Renamed to SFT
Added 2 other models.
Testing required.

* Switching to Full FineTuning

* Run linter.

* parse subfolder when needed.

* tackle saving and loading hooks.

* tackle validation.

* fix subfolder bug.

* remove __class__.

* refactor

* remove unnecessary changes

* handle saving of final model weights correctly

* remove unnecessary changes

* LTX uses a default frame rate of 24 FPS
We need to modify the output validation framerate to match that value.
Add Framerate args.
Add Update video output and inference frame rate

* There was a results_args mapping that needed to be modified.

* update

* update README

* Update README.md

* update docs

* add training configuration in cogvideox

---------

Co-authored-by: Sayak Paul &lt;spsayakpaul@gmail.com&gt;
Co-authored-by: Aryan &lt;aryan@huggingface.co&gt;
Co-authored-by: Aryan &lt;contact.aryanvs@gmail.com&gt;
diff --git a/README.md b/README.md
@@ -12,7 +12,8 @@ FineTrainers is a work-in-progress library to support (accessible) training of v
 
 ## News
 
-- 🔥 **2024-12-20**: Support for T2V LoRA finetuning of [CogVideoX](https://huggingface.co/docs/diffusers/main/api/pipelines/cogvideox) added! 
+- 🔥 **2024-01-13**: Support for T2V full-finetuning added! Thanks to @ArEnSc for taking up the initiative!
+- 🔥 **2024-01-03**: Support for T2V LoRA finetuning of [CogVideoX](https://huggingface.co/docs/diffusers/main/api/pipelines/cogvideox) added! 
 - 🔥 **2024-12-20**: Support for T2V LoRA finetuning of [Hunyuan Video](https://huggingface.co/docs/diffusers/main/api/pipelines/hunyuan_video) added! We would like to thank @SHYuanBest for his work on a training script [here](https://github.com/huggingface/diffusers/pull/10254).
 - 🔥 **2024-12-18**: Support for T2V LoRA finetuning of [LTX Video](https://huggingface.co/docs/diffusers/main/api/pipelines/ltx_video) added!
 
@@ -137,17 +138,16 @@ For inference, refer [here](./docs/training/ltx_video.md#inference). For docs re
 
 <div align="center">
 
-| **Model Name** | **Tasks** | **Min. GPU VRAM** |
-|:---:|:---:|:---:|
-| [LTX-Video](./docs/training/ltx_video.md) | Text-to-Video | 11 GB |
-| [HunyuanVideo](./docs/training/hunyuan_video.md) | Text-to-Video | 42 GB |
-| [CogVideoX](./docs/training/cogvideox.md) | Text-to-Video | 12GB<sup>*</sup> |
+| **Model Name**                                   | **Tasks**     | **Min. LoRA VRAM<sup>*</sup>**     | **Min. Full Finetuning VRAM<sup>^</sup>**     |
+|:------------------------------------------------:|:-------------:|:----------------------------------:|:---------------------------------------------:|
+| [LTX-Video](./docs/training/ltx_video.md)        | Text-to-Video | 11 GB                              | 21 GB                                         |
+| [HunyuanVideo](./docs/training/hunyuan_video.md) | Text-to-Video | 42 GB                              | OOM                                           |
+| [CogVideoX-5b](./docs/training/cogvideox.md)     | Text-to-Video | 21 GB                              | 53 GB                                         |
 
 </div>
 
-<sub><sup>*</sup>Noted for the 5B variant.</sub>
-
-Note that the memory consumption in the table is reported with most of the options, discussed in [docs/training/optimizations](./docs/training/optimization.md), enabled.
+<sub><sup>*</sup>Noted for training-only, no validation, at resolution `49x512x768`, rank 128, with pre-computation, using fp8 weights & gradient checkpointing. Pre-computation of conditions and latents may require higher limits (but typically under 16 GB).</sub><br/>
+<sub><sup>^</sup>Noted for training-only, no validation, at resolution `49x512x768`, with pre-computation, using bf16 weights & gradient checkpointing.</sub>
 
 If you would like to use a custom dataset, refer to the dataset preparation guide [here](./docs/dataset/README.md).
 
diff --git a/docs/training/README.md b/docs/training/README.md
@@ -1,8 +1,9 @@
-This directory contains the training-related specifications for all the models we support in `finetrainers`. Each model page has:
+# FineTrainers training documentation
 
-* an example training command
-* inference example
-* numbers on memory consumption
+This directory contains the training-related specifications for all the models we support in `finetrainers`. Each model page has:
+- an example training command
+- inference example
+- numbers on memory consumption
 
 By default, we don't include any validation-related arguments in the example training commands. To enable validation inference, one can pass:
 
@@ -12,8 +13,13 @@ By default, we don't include any validation-related arguments in the example tra
 + --validation_steps 100
 ```
 
-## Model-specific docs
+Supported models:
+- [CogVideoX](./cogvideox.md)
+- [LTX-Video](./ltx_video.md)
+- [HunyuanVideo](./hunyuan_video.md)
+
+Supported training types:
+- LoRA (`--training_type lora`)
+- Full finetuning (`--training_type full-finetune`)
 
-* [CogVideoX](./cogvideox.md)
-* [LTX-Video](./ltx_video.md)
-* [HunyuanVideo](./hunyuan_video.md)
+Arguments for training are well-documented in the code. For more information, please run `python train.py --help`.
diff --git a/docs/training/cogvideox.md b/docs/training/cogvideox.md
@@ -2,6 +2,8 @@
 
 ## Training
 
+For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.
+
 ```bash
 #!/bin/bash
 export WANDB_MODE="offline"
@@ -84,6 +86,8 @@ echo -ne "-------------------- Finished executing script --------------------\n\
 
 ## Memory Usage
 
+### LoRA
+
 LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, `49x480x720` resolutions, **with precomputation**:
 
 ```
@@ -109,6 +113,31 @@ Training configuration: {
 | after validation end          | 11.145            | 28.324              |
 | after training end            | 11.144            | 11.592              |
 
+### Full finetuning
+
+```
+Training configuration: {
+    "trainable parameters": 5570283072,
+    "total samples": 1,
+    "train epochs": 2,
+    "train steps": 2,
+    "batches per device": 1,
+    "total batches observed per epoch": 1,
+    "train batch size": 1,
+    "gradient accumulation steps": 1
+}
+```
+
+| stage                         | memory_allocated  | max_memory_reserved |
+|:-----------------------------:|:-----------------:|:-------------------:|
+| after precomputing conditions |  8.880            | 8.941               |
+| after precomputing latents    |  9.300            | 12.441              |
+| before training start         | 10.376            | 10.387              |
+| after epoch 1                 | 31.160            | 52.939              |
+| before validation start       | 31.161            | 52.939              |
+| after validation end          | 31.161            | 52.939              |
+| after training end            | 31.160            | 34.295              |
+
 ## Supported checkpoints
 
 CogVideoX has multiple checkpoints as one can note [here](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce). The following checkpoints were tested with `finetrainers` and are known to be working:
diff --git a/docs/training/hunyuan_video.md b/docs/training/hunyuan_video.md
@@ -2,6 +2,8 @@
 
 ## Training
 
+For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.
+
 ```bash
 #!/bin/bash
 
@@ -87,6 +89,8 @@ echo -ne "-------------------- Finished executing script --------------------\n\
 
 ## Memory Usage
 
+### LoRA
+
 LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, `49x512x768` resolutions, **without precomputation**:
 
 ```
@@ -139,6 +143,10 @@ Training configuration: {
 
 Note: requires about `47` GB of VRAM with validation. If validation is not performed, the memory usage is reduced to about `42` GB.
 
+### Full finetuning
+
+Current, full finetuning is not supported for HunyuanVideo. It goes out of memory (OOM) for `49x512x768` resolutions.
+
 ## Inference
 
 Assuming your LoRA is saved and pushed to the HF Hub, and named `my-awesome-name/my-awesome-lora`, we can now use the finetuned model for inference:
diff --git a/docs/training/ltx_video.md b/docs/training/ltx_video.md
@@ -2,7 +2,7 @@
 
 ## Training
 
-Provided you have a dataset:
+For LoRA training, specify `--training_type lora`. For full finetuning, specify `--training_type full-finetune`.
 
 ```bash
 #!/bin/bash
@@ -88,6 +88,8 @@ echo -ne "-------------------- Finished executing script --------------------\n\
 
 ## Memory Usage
 
+### LoRA
+
 LoRA with rank 128, batch size 1, gradient checkpointing, optimizer adamw, `49x512x768` resolution, **without precomputation**:
 
 ```
@@ -140,6 +142,31 @@ Training configuration: {
 
 Note: requires about `17.5` GB of VRAM with precomputation. If validation is not performed, the memory usage is reduced to `11` GB.
 
+### Full Finetuning
+
+```
+Training configuration: {
+    "trainable parameters": 1923385472,
+    "total samples": 1,
+    "train epochs": 10,
+    "train steps": 10,
+    "batches per device": 1,
+    "total batches observed per epoch": 1,
+    "train batch size": 1,
+    "gradient accumulation steps": 1
+}
+```
+
+| stage                         | memory_allocated | max_memory_reserved |
+|:-----------------------------:|:----------------:|:-------------------:|
+| after precomputing conditions | 8.89             | 8.937               |
+| after precomputing latents    | 9.701            | 11.615              |
+| before training start         | 3.583            | 4.025               |
+| after epoch 1                 | 10.769           | 20.357              |
+| before validation start       | 10.769           | 20.357              |
+| after validation end          | 10.769           | 28.332              |
+| after training end            | 10.769           | 12.904              |
+
 ## Inference
 
 Assuming your LoRA is saved and pushed to the HF Hub, and named `my-awesome-name/my-awesome-lora`, we can now use the finetuned model for inference:
diff --git a/finetrainers/args.py b/finetrainers/args.py
@@ -207,6 +207,8 @@ class Args:
         Perform validation every `n` training steps.
     enable_model_cpu_offload (`bool`, defaults to `False`):
         Whether or not to offload different modeling components to CPU during validation.
+    validation_frame_rate (`int`, defaults to `25`):
+        Frame rate to use for the validation videos. This value is defaulted to 25, as used in LTX Video pipeline.
 
     MISCELLANEOUS ARGUMENTS
     -----------------------
@@ -319,6 +321,7 @@ class Args:
     validation_every_n_epochs: Optional[int] = None
     validation_every_n_steps: Optional[int] = None
     enable_model_cpu_offload: bool = False
+    validation_frame_rate: int = 25
 
     # Miscellaneous arguments
     tracker_name: str = "finetrainers"
@@ -417,6 +420,7 @@ def to_dict(self) -> Dict[str, Any]:
                 "validation_every_n_epochs": self.validation_every_n_epochs,
                 "validation_every_n_steps": self.validation_every_n_steps,
                 "enable_model_cpu_offload": self.enable_model_cpu_offload,
+                "validation_frame_rate": self.validation_frame_rate,
             },
             "miscellaneous_arguments": {
                 "tracker_name": self.tracker_name,
@@ -460,6 +464,7 @@ def parse_arguments() -> Args:
 
 
 def validate_args(args: Args):
+    _validate_training_args(args)
     _validate_validation_args(args)
 
 
@@ -678,8 +683,9 @@ def _add_training_arguments(parser: argparse.ArgumentParser) -> None:
     parser.add_argument(
         "--training_type",
         type=str,
+        choices=["lora", "full-finetune"],
         required=True,
-        help="Type of training to perform. Choose between ['lora']",
+        help="Type of training to perform. Choose between ['lora', 'full-finetune']",
     )
     parser.add_argument("--seed", type=int, default=None, help="A seed for reproducible training.")
     parser.add_argument(
@@ -713,7 +719,11 @@ def _add_training_arguments(parser: argparse.ArgumentParser) -> None:
         help="The lora_alpha to compute scaling factor (lora_alpha / rank) for LoRA matrices.",
     )
     parser.add_argument(
-        "--target_modules", type=str, default="to_k to_q to_v to_out.0", nargs="+", help="The target modules for LoRA."
+        "--target_modules",
+        type=str,
+        default=["to_k", "to_q", "to_v", "to_out.0"],
+        nargs="+",
+        help="The target modules for LoRA.",
     )
     parser.add_argument(
         "--gradient_accumulation_steps",
@@ -890,6 +900,12 @@ def _add_validation_arguments(parser: argparse.ArgumentParser) -> None:
         default=None,
         help="Run validation every X training steps. Validation consists of running the validation prompt `args.num_validation_videos` times.",
     )
+    parser.add_argument(
+        "--validation_frame_rate",
+        type=int,
+        default=25,
+        help="Frame rate to use for the validation videos.",
+    )
     parser.add_argument(
         "--enable_model_cpu_offload",
         action="store_true",
@@ -1085,6 +1101,7 @@ def _map_to_args_type(args: Dict[str, Any]) -> Args:
     result_args.validation_every_n_epochs = args.validation_epochs
     result_args.validation_every_n_steps = args.validation_steps
     result_args.enable_model_cpu_offload = args.enable_model_cpu_offload
+    result_args.validation_frame_rate = args.validation_frame_rate
 
     # Miscellaneous arguments
     result_args.tracker_name = args.tracker_name
@@ -1100,6 +1117,15 @@ def _map_to_args_type(args: Dict[str, Any]) -> Args:
     return result_args
 
 
+def _validate_training_args(args: Args):
+    if args.training_type == "lora":
+        assert args.rank is not None, "Rank is required for LoRA training"
+        assert args.lora_alpha is not None, "LoRA alpha is required for LoRA training"
+        assert (
+            args.target_modules is not None and len(args.target_modules) > 0
+        ), "Target modules are required for LoRA training"
+
+
 def _validate_validation_args(args: Args):
     assert args.validation_prompts is not None, "Validation prompts are required for validation"
     if args.validation_images is not None:
diff --git a/finetrainers/cogvideox/__init__.py b/finetrainers/cogvideox/__init__.py
@@ -1 +1,2 @@
 from .cogvideox_lora import COGVIDEOX_T2V_LORA_CONFIG
+from .full_finetune import COGVIDEOX_T2V_FULL_FINETUNE_CONFIG
diff --git a/finetrainers/cogvideox/cogvideox_lora.py b/finetrainers/cogvideox/cogvideox_lora.py
@@ -311,6 +311,7 @@ def _pad_frames(latents: torch.Tensor, patch_size_t: int):
     return latents
 
 
+# TODO(aryan): refactor into model specs for better re-use
 COGVIDEOX_T2V_LORA_CONFIG = {
     "pipeline_cls": CogVideoXPipeline,
     "load_condition_models": load_condition_models,
diff --git a/finetrainers/cogvideox/full_finetune.py b/finetrainers/cogvideox/full_finetune.py
@@ -0,0 +1,32 @@
+from diffusers import CogVideoXPipeline
+
+from .cogvideox_lora import (
+    calculate_noisy_latents,
+    collate_fn_t2v,
+    forward_pass,
+    initialize_pipeline,
+    load_condition_models,
+    load_diffusion_models,
+    load_latent_models,
+    post_latent_preparation,
+    prepare_conditions,
+    prepare_latents,
+    validation,
+)
+
+
+# TODO(aryan): refactor into model specs for better re-use
+COGVIDEOX_T2V_FULL_FINETUNE_CONFIG = {
+    "pipeline_cls": CogVideoXPipeline,
+    "load_condition_models": load_condition_models,
+    "load_latent_models": load_latent_models,
+    "load_diffusion_models": load_diffusion_models,
+    "initialize_pipeline": initialize_pipeline,
+    "prepare_conditions": prepare_conditions,
+    "prepare_latents": prepare_latents,
+    "post_latent_preparation": post_latent_preparation,
+    "collate_fn": collate_fn_t2v,
+    "calculate_noisy_latents": calculate_noisy_latents,
+    "forward_pass": forward_pass,
+    "validation": validation,
+}
diff --git a/finetrainers/hunyuan_video/__init__.py b/finetrainers/hunyuan_video/__init__.py
@@ -1 +1,2 @@
+from .full_finetune import HUNYUAN_VIDEO_T2V_FULL_FINETUNE_CONFIG
 from .hunyuan_video_lora import HUNYUAN_VIDEO_T2V_LORA_CONFIG
diff --git a/finetrainers/hunyuan_video/full_finetune.py b/finetrainers/hunyuan_video/full_finetune.py
@@ -0,0 +1,30 @@
+from diffusers import HunyuanVideoPipeline
+
+from .hunyuan_video_lora import (
+    collate_fn_t2v,
+    forward_pass,
+    initialize_pipeline,
+    load_condition_models,
+    load_diffusion_models,
+    load_latent_models,
+    post_latent_preparation,
+    prepare_conditions,
+    prepare_latents,
+    validation,
+)
+
+
+# TODO(aryan): refactor into model specs for better re-use
+HUNYUAN_VIDEO_T2V_FULL_FINETUNE_CONFIG = {
+    "pipeline_cls": HunyuanVideoPipeline,
+    "load_condition_models": load_condition_models,
+    "load_latent_models": load_latent_models,
+    "load_diffusion_models": load_diffusion_models,
+    "initialize_pipeline": initialize_pipeline,
+    "prepare_conditions": prepare_conditions,
+    "prepare_latents": prepare_latents,
+    "post_latent_preparation": post_latent_preparation,
+    "collate_fn": collate_fn_t2v,
+    "forward_pass": forward_pass,
+    "validation": validation,
+}
diff --git a/finetrainers/hunyuan_video/hunyuan_video_lora.py b/finetrainers/hunyuan_video/hunyuan_video_lora.py
@@ -345,6 +345,7 @@ def _get_clip_prompt_embeds(
     return {"pooled_prompt_embeds": prompt_embeds}
 
 
+# TODO(aryan): refactor into model specs for better re-use
 HUNYUAN_VIDEO_T2V_LORA_CONFIG = {
     "pipeline_cls": HunyuanVideoPipeline,
     "load_condition_models": load_condition_models,
diff --git a/finetrainers/ltx_video/__init__.py b/finetrainers/ltx_video/__init__.py
@@ -1 +1,2 @@
+from .full_finetune import LTX_VIDEO_T2V_FULL_FINETUNE_CONFIG
 from .ltx_video_lora import LTX_VIDEO_T2V_LORA_CONFIG
diff --git a/finetrainers/ltx_video/full_finetune.py b/finetrainers/ltx_video/full_finetune.py
diff --git a/finetrainers/ltx_video/ltx_video_lora.py b/finetrainers/ltx_video/ltx_video_lora.py
diff --git a/finetrainers/models.py b/finetrainers/models.py
diff --git a/finetrainers/trainer.py b/finetrainers/trainer.py

Original file line number	Diff line number	Diff line change
`@@ -1 +1,2 @@`
`1`	`1`	`from .cogvideox_lora import COGVIDEOX_T2V_LORA_CONFIG`
	`2`	`+from .full_finetune import COGVIDEOX_T2V_FULL_FINETUNE_CONFIG`
Original file line number	Diff line number	Diff line change
`@@ -1 +1,2 @@`
	`1`	`+from .full_finetune import HUNYUAN_VIDEO_T2V_FULL_FINETUNE_CONFIG`
`1`	`2`	`from .hunyuan_video_lora import HUNYUAN_VIDEO_T2V_LORA_CONFIG`