darshil0805
diff --git a/‎.github/workflows/push_tests.yml‎
Lines changed: 1 addition & 0 deletions b/‎.github/workflows/push_tests.yml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source/en/api/pipelines/cogvideox.md‎
Lines changed: 7 additions & 1 deletion b/‎docs/source/en/api/pipelines/cogvideox.md‎
Lines changed: 7 additions & 1 deletion
diff --git a/‎docs/source/en/api/pipelines/flux.md‎
Lines changed: 12 additions & 0 deletions b/‎docs/source/en/api/pipelines/flux.md‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎examples/dreambooth/README_flux.md‎
Lines changed: 41 additions & 8 deletions b/‎examples/dreambooth/README_flux.md‎
Lines changed: 41 additions & 8 deletions
diff --git a/‎examples/dreambooth/train_dreambooth_lora_sd3.py‎
Lines changed: 9 additions & 15 deletions b/‎examples/dreambooth/train_dreambooth_lora_sd3.py‎
Lines changed: 9 additions & 15 deletions
diff --git a/‎src/diffusers/__init__.py‎
Lines changed: 6 additions & 0 deletions b/‎src/diffusers/__init__.py‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎src/diffusers/image_processor.py‎
Lines changed: 1 addition & 1 deletion b/‎src/diffusers/image_processor.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/diffusers/loaders/lora_conversion_utils.py‎
Lines changed: 4 additions & 2 deletions b/‎src/diffusers/loaders/lora_conversion_utils.py‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎src/diffusers/models/attention.py‎
Lines changed: 20 additions & 2 deletions b/‎src/diffusers/models/attention.py‎
Lines changed: 20 additions & 2 deletions
@@ -1,6 +1,7 @@
 name: Fast GPU Tests on main
 
 on:
+  workflow_dispatch:
   push:
     branches:
       - main
 
@@ -98,6 +98,12 @@ It is also worth noting that torchao quantization is fully compatible with [torc
   - all
   - __call__
 
+## CogVideoXVideoToVideoPipeline
+
+[[autodoc]] CogVideoXVideoToVideoPipeline
+  - all
+  - __call__
+
 ## CogVideoXPipelineOutput
 
-[[autodoc]] pipelines.cogvideo.pipeline_cogvideox.CogVideoXPipelineOutput
+[[autodoc]] pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput
@@ -163,3 +163,15 @@ image.save("flux-fp8-dev.png")
 [[autodoc]] FluxPipeline
 	- all
 	- __call__
+
+## FluxImg2ImgPipeline
+
+[[autodoc]] FluxImg2ImgPipeline
+	- all
+	- __call__
+
+## FluxInpaintPipeline
+
+[[autodoc]] FluxInpaintPipeline
+	- all
+	- __call__
@@ -8,8 +8,10 @@ The `train_dreambooth_flux.py` script shows how to implement the training proced
 >
 > Flux can be quite expensive to run on consumer hardware devices and as a result finetuning it comes with high memory requirements -
 > a LoRA with a rank of 16 (w/ all components trained) can exceed 40GB of VRAM for training.
-> For more tips & guidance on training on a resource-constrained device please visit [`@bghira`'s guide](https://github.com/bghira/SimpleTuner/blob/main/documentation/quickstart/FLUX.md)
 
+> For more tips & guidance on training on a resource-constrained device and general good practices please check out these great guides and trainers for FLUX: 
+> 1) [`@bghira`'s guide](https://github.com/bghira/SimpleTuner/blob/main/documentation/quickstart/FLUX.md)
+> 2) [`ostris`'s guide](https://github.com/ostris/ai-toolkit?tab=readme-ov-file#flux1-training)
 
 > [!NOTE]
 > **Gated model**
@@ -100,8 +102,10 @@ accelerate launch train_dreambooth_flux.py \
   --instance_prompt="a photo of sks dog" \
   --resolution=1024 \
   --train_batch_size=1 \
+  --guidance_scale=1 \
   --gradient_accumulation_steps=4 \
-  --learning_rate=1e-4 \
+  --optimizer="prodigy" \
+  --learning_rate=1. \
   --report_to="wandb" \
   --lr_scheduler="constant" \
   --lr_warmup_steps=0 \
@@ -120,15 +124,23 @@ To better track our training experiments, we're using the following flags in the
 > [!NOTE]
 > If you want to train using long prompts with the T5 text encoder, you can use `--max_sequence_length` to set the token limit. The default is 77, but it can be increased to as high as 512. Note that this will use more resources and may slow down the training in some cases.
 
-> [!TIP]
-> You can pass `--use_8bit_adam` to reduce the memory requirements of training. Make sure to install `bitsandbytes` if you want to do so.
-
 ## LoRA + DreamBooth
 
 [LoRA](https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora) is a popular parameter-efficient fine-tuning technique that allows you to achieve full-finetuning like performance but with a fraction of learnable parameters.
 
 Note also that we use PEFT library as backend for LoRA training, make sure to have `peft>=0.6.0` installed in your environment.
 
+### Prodigy Optimizer
+Prodigy is an adaptive optimizer that dynamically adjusts the learning rate learned parameters based on past gradients, allowing for more efficient convergence. 
+By using prodigy we can "eliminate" the need for manual learning rate tuning. read more [here](https://huggingface.co/blog/sdxl_lora_advanced_script#adaptive-optimizers).
+
+to use prodigy, specify
+```bash
+--optimizer="prodigy"
+```
+> [!TIP]
+> When using prodigy it's generally good practice to set- `--learning_rate=1.0`
+
 To perform DreamBooth with LoRA, run:
 
 ```bash
@@ -144,8 +156,10 @@ accelerate launch train_dreambooth_lora_flux.py \
   --instance_prompt="a photo of sks dog" \
   --resolution=512 \
   --train_batch_size=1 \
+  --guidance_scale=1 \
   --gradient_accumulation_steps=4 \
-  --learning_rate=1e-5 \
+  --optimizer="prodigy" \
+  --learning_rate=1. \
   --report_to="wandb" \
   --lr_scheduler="constant" \
   --lr_warmup_steps=0 \
@@ -162,6 +176,7 @@ Alongside the transformer, fine-tuning of the CLIP text encoder is also supporte
 To do so, just specify `--train_text_encoder` while launching training. Please keep the following points in mind:
 
 > [!NOTE]
+> This is still an experimental feature. 
 > FLUX.1 has 2 text encoders (CLIP L/14 and T5-v1.1-XXL).
 By enabling `--train_text_encoder`, fine-tuning of the **CLIP encoder** is performed.
 > At the moment, T5 fine-tuning is not supported and weights remain frozen when text encoder training is enabled.
@@ -180,8 +195,10 @@ accelerate launch train_dreambooth_lora_flux.py \
   --instance_prompt="a photo of sks dog" \
   --resolution=512 \
   --train_batch_size=1 \
+  --guidance_scale=1 \
   --gradient_accumulation_steps=4 \
-  --learning_rate=1e-5 \
+  --optimizer="prodigy" \
+  --learning_rate=1. \
   --report_to="wandb" \
   --lr_scheduler="constant" \
   --lr_warmup_steps=0 \
@@ -191,5 +208,21 @@ accelerate launch train_dreambooth_lora_flux.py \
   --push_to_hub
 ```
 
+## Memory Optimizations
+As mentioned, Flux Dreambooth LoRA training is very memory intensive Here are some options (some still experimental) for a more memory efficient training.
+### Image Resolution
+An easy way to mitigate some of the memory requirements is through `--resolution`. `--resolution` refers to the resolution for input images, all the images in the train/validation dataset are resized to this.
+Note that by default, images are resized to resolution of 512, but it's good to keep in mind in case you're accustomed to training on higher resolutions. 
+### Gradient Checkpointing and Accumulation
+* `--gradient accumulation` refers to the number of updates steps to accumulate before performing a backward/update pass.
+by passing a value > 1 you can reduce the amount of backward/update passes and hence also memory reqs.
+* with `--gradient checkpointing` we can save memory by not storing all intermediate activations during the forward pass.
+Instead, only a subset of these activations (the checkpoints) are stored and the rest is recomputed as needed during the backward pass. Note that this comes at the expanse of a slower backward pass.
+### 8-bit-Adam Optimizer
+When training with `AdamW`(doesn't apply to `prodigy`) You can pass `--use_8bit_adam` to reduce the memory requirements of training. 
+Make sure to install `bitsandbytes` if you want to do so.
+### latent caching
+When training w/o validation runs, we can pre-encode the training images with the vae, and then delete it to free up some memory. 
+to enable `latent_caching`, first, use the version in [this PR](https://github.com/huggingface/diffusers/blob/1b195933d04e4c8281a2634128c0d2d380893f73/examples/dreambooth/train_dreambooth_lora_flux.py), and then pass `--cache_latents`
 ## Other notes
-Thanks to `bghira` for their help with reviewing & insight sharing ♥️
+Thanks to `bghira` and `ostris` for their help with reviewing & insight sharing ♥️
@@ -15,7 +15,6 @@
 
 import argparse
 import copy
-import gc
 import itertools
 import logging
 import math
@@ -56,6 +55,7 @@
 from diffusers.training_utils import (
     _set_state_dict_into_text_encoder,
     cast_training_params,
+    clear_objs_and_retain_memory,
     compute_density_for_timestep_sampling,
     compute_loss_weighting_for_sd3,
 )
@@ -210,9 +210,7 @@ def log_validation(
                 }
             )
 
-    del pipeline
-    if torch.cuda.is_available():
-        torch.cuda.empty_cache()
+    clear_objs_and_retain_memory(objs=[pipeline])
 
     return images
 
@@ -1107,9 +1105,7 @@ def main(args):
                     image_filename = class_images_dir / f"{example['index'][i] + cur_class_images}-{hash_image}.jpg"
                     image.save(image_filename)
 
-            del pipeline
-            if torch.cuda.is_available():
-                torch.cuda.empty_cache()
+            clear_objs_and_retain_memory(objs=[pipeline])
 
     # Handle the repository creation
     if accelerator.is_main_process:
@@ -1455,12 +1451,10 @@ def compute_text_embeddings(prompt, text_encoders, tokenizers):
 
     # Clear the memory here
     if not args.train_text_encoder and not train_dataset.custom_instance_prompts:
-        del tokenizers, text_encoders
         # Explicitly delete the objects as well, otherwise only the lists are deleted and the original references remain, preventing garbage collection
-        del text_encoder_one, text_encoder_two, text_encoder_three
-        gc.collect()
-        if torch.cuda.is_available():
-            torch.cuda.empty_cache()
+        clear_objs_and_retain_memory(
+            objs=[tokenizers, text_encoders, text_encoder_one, text_encoder_two, text_encoder_three]
+        )
 
     # If custom instance prompts are NOT provided (i.e. the instance prompt is used for all images),
     # pack the statically computed variables appropriately here. This is so that we don't
@@ -1795,11 +1789,11 @@ def get_sigmas(timesteps, n_dim=4, dtype=torch.float32):
                     pipeline_args=pipeline_args,
                     epoch=epoch,
                 )
+                objs = []
                 if not args.train_text_encoder:
-                    del text_encoder_one, text_encoder_two, text_encoder_three
+                    objs.extend([text_encoder_one, text_encoder_two, text_encoder_three])
 
-                torch.cuda.empty_cache()
-                gc.collect()
+                clear_objs_and_retain_memory(objs=objs)
 
     # Save the lora layers
     accelerator.wait_for_everyone()
 
@@ -255,8 +255,11 @@
             "BlipDiffusionPipeline",
             "CLIPImageProjection",
             "CogVideoXPipeline",
+            "CogVideoXVideoToVideoPipeline",
             "CycleDiffusionPipeline",
             "FluxControlNetPipeline",
+            "FluxImg2ImgPipeline",
+            "FluxInpaintPipeline",
             "FluxPipeline",
             "HunyuanDiTControlNetPipeline",
             "HunyuanDiTPAGPipeline",
@@ -700,8 +703,11 @@
             AuraFlowPipeline,
             CLIPImageProjection,
             CogVideoXPipeline,
+            CogVideoXVideoToVideoPipeline,
             CycleDiffusionPipeline,
             FluxControlNetPipeline,
+            FluxImg2ImgPipeline,
+            FluxInpaintPipeline,
             FluxPipeline,
             HunyuanDiTControlNetPipeline,
             HunyuanDiTPAGPipeline,
 
@@ -569,7 +569,7 @@ def preprocess(
 
             channel = image.shape[1]
             # don't need any preprocess if the image is latents
-            if channel == self.vae_latent_channels:
+            if channel == self.config.vae_latent_channels:
                 return image
 
             height, width = self.get_default_height_width(image, height, width)
 
@@ -562,7 +562,8 @@ def handle_qkv(sds_sd, ait_sd, sds_key, ait_keys, dims=None):
                 new_key += ".attn.to_out.0"
             elif "processor.proj_lora2" in old_key:
                 new_key += ".attn.to_add_out"
-            elif "processor.qkv_lora1" in old_key and "up" not in old_key:
+            # Handle text latents.
+            elif "processor.qkv_lora2" in old_key and "up" not in old_key:
                 handle_qkv(
                     old_state_dict,
                     new_state_dict,
@@ -574,7 +575,8 @@ def handle_qkv(sds_sd, ait_sd, sds_key, ait_keys, dims=None):
                     ],
                 )
                 # continue
-            elif "processor.qkv_lora2" in old_key and "up" not in old_key:
+            # Handle image latents.
+            elif "processor.qkv_lora1" in old_key and "up" not in old_key:
                 handle_qkv(
                     old_state_dict,
                     new_state_dict,
 
@@ -1104,8 +1104,26 @@ def forward(
                 accumulated_values[:, frame_start:frame_end] += hidden_states_chunk * weights
                 num_times_accumulated[:, frame_start:frame_end] += weights
 
-        hidden_states = torch.where(
-            num_times_accumulated > 0, accumulated_values / num_times_accumulated, accumulated_values
+        # TODO(aryan): Maybe this could be done in a better way.
+        #
+        # Previously, this was:
+        # hidden_states = torch.where(
+        #    num_times_accumulated > 0, accumulated_values / num_times_accumulated, accumulated_values
+        # )
+        #
+        # The reasoning for the change here is `torch.where` became a bottleneck at some point when golfing memory
+        # spikes. It is particularly noticeable when the number of frames is high. My understanding is that this comes
+        # from tensors being copied - which is why we resort to spliting and concatenating here. I've not particularly
+        # looked into this deeply because other memory optimizations led to more pronounced reductions.
+        hidden_states = torch.cat(
+            [
+                torch.where(num_times_split > 0, accumulated_split / num_times_split, accumulated_split)
+                for accumulated_split, num_times_split in zip(
+                    accumulated_values.split(self.context_length, dim=1),
+                    num_times_accumulated.split(self.context_length, dim=1),
+                )
+            ],
+            dim=1,
         ).to(dtype)
 
         # 3. Feed-forward