DreamBooth is a method to personalize text2image models like stable diffusion given just a few(3~5) images of a subject.
The train_dreambooth.py script shows how to implement the training procedure and adapt it for stable diffusion.
Before running the scripts, make sure to install the library's training dependencies:
Important
To make sure you can successfully run the latest versions of the example scripts, we highly recommend installing from source and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
git clone https://github.com/mindspore-lab/mindone
cd mindone
pip install -e ".[training]"Now let's get our dataset. For this example we will use some dog images: https://huggingface.co/datasets/diffusers/dog-example.
Let's first download it locally:
from huggingface_hub import snapshot_download
local_dir = "./dog"
snapshot_download(
"diffusers/dog-example",
local_dir=local_dir, repo_type="dataset",
ignore_patterns=".gitattributes",
)And launch the training using:
Note: Change the resolution to 768 if you are using the stable-diffusion-2 768x768 model.
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="path-to-save-model"
python train_dreambooth.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a photo of sks dog" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--learning_rate=5e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=400Prior-preservation is used to avoid overfitting and language-drift. Refer to the paper to learn more about it. For prior-preservation we first generate images using the model with a class prompt and then use those during training along with our data.
According to the paper, it's recommended to generate num_epochs * num_samples images for prior-preservation. 200-300 works well for most cases. The num_class_images flag sets the number of images to generate with the class prompt. You can place existing images in class_data_dir, and the training script will generate any additional images so that num_class_images are present in class_data_dir during training time.
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="dog"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="path-to-save-model"
python train_dreambooth.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--class_data_dir=$CLASS_DIR \
--output_dir=$OUTPUT_DIR \
--with_prior_preservation --prior_loss_weight=1.0 \
--instance_prompt="a photo of sks dog" \
--class_prompt="a photo of dog" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--learning_rate=5e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=200 \
--max_train_steps=800The script also allows to fine-tune the text_encoder along with the unet. It's been observed experimentally that fine-tuning text_encoder gives much better results especially on faces.
Pass the --train_text_encoder argument to the script to enable training text_encoder.
Note: Training text encoder requires more memory.
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="dog"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="path-to-save-model"
python train_dreambooth.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--class_data_dir=$CLASS_DIR \
--output_dir=$OUTPUT_DIR \
--with_prior_preservation --prior_loss_weight=1.0 \
--instance_prompt="a photo of sks dog" \
--class_prompt="a photo of dog" \
--resolution=512 \
--train_batch_size=1 \
--learning_rate=2e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=200 \
--max_train_steps=800 \
--train_text_encoderOnce you have trained a model using the above command, you can run inference simply using the StableDiffusionPipeline. Make sure to include the identifier (e.g. sks in above example) in your prompt.
import mindspore as ms
from mindone.diffusers import StableDiffusionPipeline
model_id = "path-to-your-trained-model"
pipe = StableDiffusionPipeline.from_pretrained(model_id, mindspore_dtype=ms.float16)
prompt = "A photo of sks dog in a bucket"
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5)[0][0]
image.save("dog-bucket.png")You can also perform inference from one of the checkpoints saved during the training process, if you used the --checkpointing_steps argument. Please, refer to the documentation to see how to do it.
Low-Rank Adaption of Large Language Models was first introduced by Microsoft in LoRA: Low-Rank Adaptation of Large Language Models by Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen
In a nutshell, LoRA allows to adapt pretrained models by adding pairs of rank-decomposition matrices to existing weights and only training those newly added weights. This has a couple of advantages:
- Previous pretrained weights are kept frozen so that the model is not prone to catastrophic forgetting
- Rank-decomposition matrices have significantly fewer parameters than the original model, which means that trained LoRA weights are easily portable.
- LoRA attention layers allow to control to which extent the model is adapted towards new training images via a
scaleparameter.
cloneofsimo was the first to try out LoRA training for Stable Diffusion in the popular lora GitHub repository.
Let's get started with a simple example. We will re-use the dog example of the previous section.
First, you need to set-up your dreambooth training example as is explained in the installation section.
Next, let's download the dog dataset. Download images from here and save them in a directory. Make sure to set INSTANCE_DIR to the name of your directory further below. This will be our training data.
Now, you can launch the training. Here we will use Stable Diffusion 1-5.
Note: Change the resolution to 768 if you are using the stable-diffusion-2 768x768 model.
Note: It is quite useful to monitor the training progress by regularly generating sample images during training. wandb is a nice solution to easily see generating images during training. All you need to do is to run pip install wandb before training and pass --report_to="wandb" to automatically log images.
Now we can start training!
export MODEL_NAME="stable-diffusion-v1-5/stable-diffusion-v1-5"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="path-to-save-model"
python train_dreambooth_lora.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a photo of sks dog" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--checkpointing_steps=100 \
--learning_rate=1e-4 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_prompt="A photo of sks dog in a bucket" \
--validation_epochs=50 \
--seed="0"Note: When using LoRA we can use a much higher learning rate compared to vanilla dreambooth. Here we use 1e-4 instead of the usual 2e-6.
The final LoRA embedding weights have been uploaded to patrickvonplaten/lora_dreambooth_dog_example. ___Note: The final weights are only 3 MB in size which is orders of magnitudes smaller than the original model.
The training results are summarized here.
You can use the Step slider to see how the model learned the features of our subject while the model trained.
Optionally, we can also train additional LoRA layers for the text encoder. Specify the --train_text_encoder argument above for that. If you're interested to know more about how we
enable this support, check out this PR.
With the default hyperparameters from the above, the training seems to go in a positive direction. Check out this panel. The trained LoRA layers are available here.
After training, LoRA weights can be loaded very easily into the original pipeline. First, you need to load the original pipeline:
from mindone.diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("base-model-name")Next, we can load the adapter layers into the pipeline with the load_lora_weights function.
pipe.load_lora_weights("path-to-the-lora-checkpoint")Finally, we can run the model in inference.
image = pipe("A picture of a sks dog in a bucket", num_inference_steps=50)[0][0]If you are loading the LoRA parameters from the Hub and if the Hub repository has
a base_model tag (such as this), then
you can do:
from huggingface_hub.repocard import RepoCard
lora_model_id = "patrickvonplaten/lora_dreambooth_dog_example"
card = RepoCard.load(lora_model_id)
base_model_id = card.data.to_dict()["base_model"]
pipe = StableDiffusionPipeline.from_pretrained(base_model_id, mindspore_dtype=ms.float16)
...If you used --train_text_encoder during training, then use pipe.load_lora_weights() to load the LoRA
weights. For example:
from huggingface_hub.repocard import RepoCard
from mindone.diffusers import StableDiffusionPipeline
import mindspore as ms
lora_model_id = "sayakpaul/dreambooth-text-encoder-test"
card = RepoCard.load(lora_model_id)
base_model_id = card.data.to_dict()["base_model"]
pipe = StableDiffusionPipeline.from_pretrained(base_model_id, mindspore_dtype=ms.float16)
pipe.load_lora_weights(lora_model_id)
image = pipe("A picture of a sks dog in a bucket", num_inference_steps=25)[0][0]Note that the use of LoraLoaderMixin.load_lora_weights is preferred to UNet2DConditionLoadersMixin.load_attn_procs for loading LoRA parameters. This is because
LoraLoaderMixin.load_lora_weights can handle the following situations:
- LoRA parameters that don't have separate identifiers for the UNet and the text encoder (such as
"patrickvonplaten/lora_dreambooth_dog_example"). So, you can just do:
pipe.load_lora_weights(lora_model_path)- LoRA parameters that have separate identifiers for the UNet and the text encoder such as:
"sayakpaul/dreambooth".
You can use the lora and full dreambooth scripts to train the text to image IF model and the stage II upscaler IF model.
Note that IF has a predicted variance, and our finetuning scripts only train the models predicted error, so for finetuned IF models we switch to a fixed variance schedule. The full finetuning scripts will update the scheduler config for the full saved model. However, when loading saved LoRA weights, you must also update the pipeline's scheduler config.
from mindone.diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0")
pipe.load_lora_weights("<lora weights path>")
# Update scheduler config to fixed variance schedule
pipe.scheduler = pipe.scheduler.__class__.from_config(pipe.scheduler.config, variance_type="fixed_small")Additionally, a few alternative cli flags are needed for IF.
--resolution=64: IF is a pixel space diffusion model. In order to operate on un-compressed pixels, the input images are of a much smaller resolution.
--pre_compute_text_embeddings: IF uses T5 for its text encoder. In order to save memory, we pre-compute all text embeddings and then de-allocate
T5.
--tokenizer_max_length=77: T5 has a longer default text length, but the default IF encoding procedure uses a smaller number.
--text_encoder_use_attention_mask: T5 passes the attention mask to the text encoder.
We find LoRA to be sufficient for finetuning the stage I model as the low resolution of the model makes representing finegrained detail hard regardless.
For common and/or not-visually complex object concepts, you can get away with not-finetuning the upscaler. Just be sure to adjust the prompt passed to the upscaler to remove the new token from the instance prompt. I.e. if your stage I prompt is "a sks dog", use "a dog" for your stage II prompt.
For finegrained detail like faces that aren't present in the original training set, we find that full finetuning of the stage II upscaler is better than LoRA finetuning stage II.
For finegrained detail like faces, we find that lower learning rates along with larger batch sizes work best.
For stage II, we find that lower learning rates are also needed.
We found experimentally that the DDPM scheduler with the default larger number of denoising steps to sometimes work better than the DPM Solver scheduler used in the training scripts.
The stage II validation requires images to upscale, we can download a downsized version of the training set:
from huggingface_hub import snapshot_download
local_dir = "./dog_downsized"
snapshot_download(
"diffusers/dog-example-downsized",
local_dir=local_dir,
repo_type="dataset",
ignore_patterns=".gitattributes",
)export MODEL_NAME="DeepFloyd/IF-I-XL-v1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="dreambooth_dog_lora"
python train_dreambooth_lora.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a sks dog" \
--resolution=64 \
--train_batch_size=4 \
--gradient_accumulation_steps=1 \
--learning_rate=5e-6 \
--scale_lr \
--max_train_steps=1200 \
--validation_prompt="a sks dog" \
--validation_epochs=25 \
--checkpointing_steps=100 \
--pre_compute_text_embeddings \
--tokenizer_max_length=77 \
--text_encoder_use_attention_mask--validation_images: These images are upscaled during validation steps.
--class_labels_conditioning=timesteps: Pass additional conditioning to the UNet needed for stage II.
--learning_rate=1e-6: Lower learning rate than stage I.
--resolution=256: The upscaler expects higher resolution inputs
export MODEL_NAME="DeepFloyd/IF-II-L-v1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="dreambooth_dog_upscale"
export VALIDATION_IMAGES="dog_downsized/image_1.jpg dog_downsized/image_2.jpg dog_downsized/image_3.jpg dog_downsized/image_4.jpg"
python train_dreambooth_lora.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a sks dog" \
--resolution=256 \
--train_batch_size=4 \
--gradient_accumulation_steps=1 \
--learning_rate=1e-6 \
--max_train_steps=2000 \
--validation_prompt="a sks dog" \
--validation_epochs=100 \
--checkpointing_steps=500 \
--pre_compute_text_embeddings \
--tokenizer_max_length=77 \
--text_encoder_use_attention_mask \
--validation_images $VALIDATION_IMAGES \
--class_labels_conditioning=timesteps--skip_save_text_encoder: When training the full model, this will skip saving the entire T5 with the finetuned model. You can still load the pipeline
with a T5 loaded from the original model.
use_8bit_adam: Due to the size of the optimizer states, we recommend training the full XL IF model with 8bit adam.
--learning_rate=1e-7: For full dreambooth, IF requires very low learning rates. With higher learning rates model quality will degrade. Note that it is
likely the learning rate can be increased with larger batch sizes.
--validation_scheduler: Set a particular scheduler via a string. We found that it is better to use the DDPMScheduler for validation when training DeepFloyd IF.
export MODEL_NAME="DeepFloyd/IF-I-XL-v1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="dreambooth_if"
python train_dreambooth.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a photo of sks dog" \
--resolution=64 \
--train_batch_size=4 \
--gradient_accumulation_steps=1 \
--learning_rate=1e-7 \
--max_train_steps=150 \
--validation_prompt "a photo of sks dog" \
--validation_steps 25 \
--text_encoder_use_attention_mask \
--tokenizer_max_length 77 \
--pre_compute_text_embeddings \
--validation_scheduler DDPMScheduler--learning_rate=5e-6: With a smaller effective batch size of 4, we found that we required learning rates as low as
1e-8.
--resolution=256: The upscaler expects higher resolution inputs
--train_batch_size=2 and --gradient_accumulation_steps=6: We found that full training of stage II particularly with
faces required large effective batch sizes.
export MODEL_NAME="DeepFloyd/IF-II-L-v1.0"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="dreambooth_dog_upscale"
export VALIDATION_IMAGES="dog_downsized/image_1.jpg dog_downsized/image_2.jpg dog_downsized/image_3.jpg dog_downsized/image_4.jpg"
python train_dreambooth.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a sks dog" \
--resolution=256 \
--train_batch_size=2 \
--gradient_accumulation_steps=6 \
--learning_rate=5e-6 \
--max_train_steps=2000 \
--validation_prompt="a sks dog" \
--validation_steps=150 \
--checkpointing_steps=500 \
--pre_compute_text_embeddings \
--tokenizer_max_length=77 \
--text_encoder_use_attention_mask \
--validation_images $VALIDATION_IMAGES \
--class_labels_conditioning timesteps \
--validation_scheduler DDPMSchedulerWe support fine-tuning of the UNet shipped in Stable Diffusion XL with DreamBooth and LoRA via the train_dreambooth_lora_sdxl.py script. Please refer to the docs here.