Skip to content

Conversation

@sayakpaul
Copy link
Member

@sayakpaul sayakpaul commented Aug 4, 2025

What does this PR do?

Still testing. Needs a custom token to test (refer Slack). We support quantization through the bnb_quantization_config_path CLI argument.

test command
export MODEL_NAME="Qwen/Qwen-Image"
export INSTANCE_DIR="linoyts/3d_icon"
export OUTPUT_DIR="trained-qwen-image-lora"

accelerate launch train_dreambooth_lora_qwen_image.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --dataset_name=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="bf16" \
  --instance_prompt="3dicon" \
  --caption_column="prompt"\
  --validation_prompt="a 3dicon, a llama eating ramen" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --use_8bit_adam \
  --rank=8 \
  --learning_rate=2e-4 \
  --report_to="wandb" \
  --lr_scheduler="constant_with_warmup" \
  --lr_warmup_steps=100 \
  --max_train_steps=1000 \
  --cache_latents\
  --gradient_checkpointing \
  --validation_epochs=25 \
  --seed="0"

TODOs:

  • Button up README.
  • Add tests for pipeline, trainer, LoRA (for tests we need to be able to deal with small sizes)

I prefer to tackle the tests in a separate PR. Some tests are already in tests/qwen-image branch, I think.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment on lines +865 to +867
# Qwen expects a `num_frames` dimension too.
if pixel_values.ndim == 4:
pixel_values = pixel_values.unsqueeze(2)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧠

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's refactor the AutoencoderKLQwenImage methods to not use frame dimension. I think that code was copy-pasted from Wan, but we don't need frame dimension. cc @naykun

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can wait for your refactor PR to come through. Or do you prefer this PR? 👀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for now the frame dimension is not needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to take it up in this PR, as I am logging off for a few hours

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was taking a stab at this, but we also need to consider the design of QwenImageCausalConv3d, which inherits from nn.Conv3d. So, a bit more involved PR than I had original thought. So, would prefer to do that in a separate PR to not block this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay sounds good, let's do in separate PR

(1, args.resolution // vae_scale_factor // 2, args.resolution // vae_scale_factor // 2)
] * bsz
# transpose the dimensions
noisy_model_input = noisy_model_input.permute(0, 2, 1, 3, 4)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧠

Comment on lines 472 to 476
parser.add_argument(
"--guidance_scale",
type=float,
default=0.0,
help="Qwen image is a guidance distilled model",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took this value from the official doc example. Correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a guidance distilled model, the guidance is actually None no matter what guidance_scale is set. Only true_cfg_scale works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we supply any guidance value at all during training? My reference is:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems this:

if self.transformer.config.guidance_embeds:

is false by default so guidance is not actually ever used as @haofanwang mentioned, so we can probably remove it altogether?
also this seems relevant: https://github.com/huggingface/diffusers/pull/12057/files#r2250725231

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guidance gone in the latest commit.
cb1b6b4

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haofanwang @naykun Let us know if we should remove the guidance embed config from the transformer implementation if it's not used: #12057 (comment)

Also, instead of calling it true_cfg_scale, we should just remove it and use guidance_scale to mean the actual CFG scale. For guidance-distilled models like Flux, we mean guidance_scale as the embedded-scale, whereas true_cfg_scale as the true scale. But, for most of the normal released models, we default to naming the CFG parameter as guidance_scale and not true_cfg_scale

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this. Totally agree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This time, we are releasing the raw model without guidance distillation. However, we hope a distilled version will become available soon—either from the community or from us. To ensure future compatibility, we may keep this unchanged?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, if it's planned to release guidance-distilled (and I think it will be highly expected in community too, so someone might take the initiative), then I think it's okay to keep as-is. Thanks for letting us know!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purpose of this PR, I have just removed the option of configuring the guidance_scale from the training script. I think that should cut the deal?

Comment on lines 1276 to 1279
with offload_models(text_encoding_pipeline, device=accelerator.device, offload=args.offload):
instance_prompt_embeds, instance_prompt_embeds_mask, _ = compute_text_embeddings(
args.instance_prompt, text_encoding_pipeline
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of the offload_models() utility to easily offload and onload modules we don't always want to be present on the accelerator device.

pixel_values = batch["pixel_values"].to(dtype=vae.dtype)
model_input = vae.encode(pixel_values).latent_dist.sample()

model_input = (model_input - latents_mean) * latents_std
Copy link
Member Author

@sayakpaul sayakpaul Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reversal of

latents_std = 1.0 / torch.tensor(self.vae.config.latents_std).view(1, self.vae.config.z_dim, 1, 1, 1).to(
latents.device, latents.dtype
)
latents = latents / latents_std + latents_mean

Comment on lines +1012 to +1017
vae = AutoencoderKLQwenImage.from_pretrained(
args.pretrained_model_name_or_path,
subfolder="vae",
revision=args.revision,
variant=args.variant,
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping it in FP32 for numerical stability. Haven't yet verified if using BF16 is alright.

Comment on lines +511 to +517
parser.add_argument(
"--weighting_scheme",
type=str,
default="none",
choices=["sigma_sqrt", "logit_normal", "mode", "cosmap", "none"],
help=('We default to the "none" weighting scheme for uniform sampling and uniform loss'),
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a reasonable default. However, we know that this can impact training significantly. For example, SD3 and LTX ise logit_normal however, for Flux and SANA, none work.

I think we should check this with the Qwen Image authors.

@sayakpaul
Copy link
Member Author

sayakpaul commented Aug 4, 2025

Tested with the following:

export MODEL_NAME="Qwen/Qwen-Image"
export INSTANCE_DIR="linoyts/3d_icon"
export OUTPUT_DIR="trained-qwen-image-lora"

accelerate launch train_dreambooth_lora_qwen_image.py \
  --pretrained_model_name_or_path $MODEL_NAME \
  --dataset_name             $INSTANCE_DIR \
  --output_dir               $OUTPUT_DIR \
  --mixed_precision          bf16 \
  --instance_prompt          "3dicon" \
  --caption_column           prompt \
  --resolution               1024 \
  --train_batch_size         1 \
  --gradient_accumulation_steps 4 \
  --use_8bit_adam \
  --rank                     8 \
  --learning_rate            2e-4 \
  --guidance_scale           1.0 \
  --report_to                wandb \
  --lr_scheduler             constant \
  --lr_warmup_steps          100 \
  --max_train_steps          1000 \
  --cache_latents \
  --gradient_checkpointing \
  --validation_epochs        25 \
  --seed                     0

Inference:

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("Qwen/Qwen-Image", torch_dtype=torch.bfloat16).to("cuda")
pipe.load_lora_weights("trained-qwen-image-lora")

image = pipe(
    "a 3dicon, a llama with a signboard saying 'Qwen is awesome'", guidance_scale=1.0, num_inference_steps=50
).images[0]
image.save("llama.png")
image

Copy link
Collaborator

@linoytsaban linoytsaban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sayakpaul 🙌🏻
left one comment re:guidance, other than that looking good!

@sayakpaul sayakpaul marked this pull request as ready for review August 4, 2025 11:12
Copy link
Contributor

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!



if is_wandb_available():
import wandb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import trackio as wandb 😛 We should do this sometime soon :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full support!

Comment on lines +865 to +867
# Qwen expects a `num_frames` dimension too.
if pixel_values.ndim == 4:
pixel_values = pixel_values.unsqueeze(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay sounds good, let's do in separate PR

@sayakpaul sayakpaul merged commit 9c1d4e3 into main Aug 5, 2025
31 of 32 checks passed
@sayakpaul sayakpaul deleted the qwen-image-training branch August 5, 2025 01:36
Beinsezii pushed a commit to Beinsezii/diffusers that referenced this pull request Aug 7, 2025
…ce#12056)

* feat: support lora in qwen image and training script

* up

* up

* up

* up

* up

* up

* add lora tests

* fix

* add tests

* fix

* reviewer feedback

* up[

* Apply suggestions from code review

Co-authored-by: Aryan <[email protected]>

---------

Co-authored-by: Aryan <[email protected]>
Beinsezii pushed a commit to Beinsezii/diffusers that referenced this pull request Aug 7, 2025
…ce#12056)

* feat: support lora in qwen image and training script

* up

* up

* up

* up

* up

* up

* add lora tests

* fix

* add tests

* fix

* reviewer feedback

* up[

* Apply suggestions from code review

Co-authored-by: Aryan <[email protected]>

---------

Co-authored-by: Aryan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants