-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[wip] feat: support lora in qwen image and training script #12056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| # Qwen expects a `num_frames` dimension too. | ||
| if pixel_values.ndim == 4: | ||
| pixel_values = pixel_values.unsqueeze(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧠
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's refactor the AutoencoderKLQwenImage methods to not use frame dimension. I think that code was copy-pasted from Wan, but we don't need frame dimension. cc @naykun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can wait for your refactor PR to come through. Or do you prefer this PR? 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for now the frame dimension is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to take it up in this PR, as I am logging off for a few hours
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I was taking a stab at this, but we also need to consider the design of QwenImageCausalConv3d, which inherits from nn.Conv3d. So, a bit more involved PR than I had original thought. So, would prefer to do that in a separate PR to not block this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay sounds good, let's do in separate PR
| (1, args.resolution // vae_scale_factor // 2, args.resolution // vae_scale_factor // 2) | ||
| ] * bsz | ||
| # transpose the dimensions | ||
| noisy_model_input = noisy_model_input.permute(0, 2, 1, 3, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧠
| parser.add_argument( | ||
| "--guidance_scale", | ||
| type=float, | ||
| default=0.0, | ||
| help="Qwen image is a guidance distilled model", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took this value from the official doc example. Correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a guidance distilled model, the guidance is actually None no matter what guidance_scale is set. Only true_cfg_scale works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we supply any guidance value at all during training? My reference is:
| guidance=guidance, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems this:
| if self.transformer.config.guidance_embeds: |
is false by default so guidance is not actually ever used as @haofanwang mentioned, so we can probably remove it altogether?
also this seems relevant: https://github.com/huggingface/diffusers/pull/12057/files#r2250725231
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guidance gone in the latest commit.
cb1b6b4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haofanwang @naykun Let us know if we should remove the guidance embed config from the transformer implementation if it's not used: #12057 (comment)
Also, instead of calling it true_cfg_scale, we should just remove it and use guidance_scale to mean the actual CFG scale. For guidance-distilled models like Flux, we mean guidance_scale as the embedded-scale, whereas true_cfg_scale as the true scale. But, for most of the normal released models, we default to naming the CFG parameter as guidance_scale and not true_cfg_scale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this. Totally agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This time, we are releasing the raw model without guidance distillation. However, we hope a distilled version will become available soon—either from the community or from us. To ensure future compatibility, we may keep this unchanged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, if it's planned to release guidance-distilled (and I think it will be highly expected in community too, so someone might take the initiative), then I think it's okay to keep as-is. Thanks for letting us know!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the purpose of this PR, I have just removed the option of configuring the guidance_scale from the training script. I think that should cut the deal?
| with offload_models(text_encoding_pipeline, device=accelerator.device, offload=args.offload): | ||
| instance_prompt_embeds, instance_prompt_embeds_mask, _ = compute_text_embeddings( | ||
| args.instance_prompt, text_encoding_pipeline | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use of the offload_models() utility to easily offload and onload modules we don't always want to be present on the accelerator device.
| pixel_values = batch["pixel_values"].to(dtype=vae.dtype) | ||
| model_input = vae.encode(pixel_values).latent_dist.sample() | ||
|
|
||
| model_input = (model_input - latents_mean) * latents_std |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reversal of
diffusers/src/diffusers/pipelines/qwenimage/pipeline_qwenimage.py
Lines 779 to 782 in 8e53cd9
| latents_std = 1.0 / torch.tensor(self.vae.config.latents_std).view(1, self.vae.config.z_dim, 1, 1, 1).to( | |
| latents.device, latents.dtype | |
| ) | |
| latents = latents / latents_std + latents_mean |
| vae = AutoencoderKLQwenImage.from_pretrained( | ||
| args.pretrained_model_name_or_path, | ||
| subfolder="vae", | ||
| revision=args.revision, | ||
| variant=args.variant, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping it in FP32 for numerical stability. Haven't yet verified if using BF16 is alright.
| parser.add_argument( | ||
| "--weighting_scheme", | ||
| type=str, | ||
| default="none", | ||
| choices=["sigma_sqrt", "logit_normal", "mode", "cosmap", "none"], | ||
| help=('We default to the "none" weighting scheme for uniform sampling and uniform loss'), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a reasonable default. However, we know that this can impact training significantly. For example, SD3 and LTX ise logit_normal however, for Flux and SANA, none work.
I think we should check this with the Qwen Image authors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sayakpaul 🙌🏻
left one comment re:guidance, other than that looking good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM!
|
|
||
|
|
||
| if is_wandb_available(): | ||
| import wandb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import trackio as wandb 😛 We should do this sometime soon :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full support!
| # Qwen expects a `num_frames` dimension too. | ||
| if pixel_values.ndim == 4: | ||
| pixel_values = pixel_values.unsqueeze(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay sounds good, let's do in separate PR
…ce#12056) * feat: support lora in qwen image and training script * up * up * up * up * up * up * add lora tests * fix * add tests * fix * reviewer feedback * up[ * Apply suggestions from code review Co-authored-by: Aryan <[email protected]> --------- Co-authored-by: Aryan <[email protected]>
…ce#12056) * feat: support lora in qwen image and training script * up * up * up * up * up * up * add lora tests * fix * add tests * fix * reviewer feedback * up[ * Apply suggestions from code review Co-authored-by: Aryan <[email protected]> --------- Co-authored-by: Aryan <[email protected]>

What does this PR do?
Still testing. Needs a custom token to test (refer Slack). We support quantization through the
bnb_quantization_config_pathCLI argument.test command
TODOs:
I prefer to tackle the tests in a separate PR. Some tests are already in
tests/qwen-imagebranch, I think.