-
Notifications
You must be signed in to change notification settings - Fork 6.2k
[modular diffusers] Wan I2V/FLF2V #11997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking great! thanks @a-r-r-o-w
I left some comments, let me know what you think! from here I think we it's very easy to support wan 2.2!
return components, state | ||
|
||
|
||
class WanVaeEncoderStep(PipelineBlock): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the vae enoder step should just encode the image and return the image_latents
the rest of logic should go into prepare_latents
This way it's more "modular", both for developing and using
e.g. at runtime, if you only want to change first or last frame you only need to encode one of them and use the image_latents directly; same if you want to change num_frames
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, I understand now. I missed the PrepareLatents nodes specific to img2img example in SDXL. Will implement it correctly soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yiyixuxu I was taking a look at this today. A little confused by what is entailed by "rest of the logic" here. Is it just the following part, or more?
mask_lat_size = torch.ones(batch_size, 1, num_frames, latent_height, latent_width)
if last_image is None:
mask_lat_size[:, :, list(range(1, num_frames))] = 0
else:
mask_lat_size[:, :, list(range(1, num_frames - 1))] = 0
first_frame_mask = mask_lat_size[:, :, 0:1]
first_frame_mask = torch.repeat_interleave(
first_frame_mask, dim=2, repeats=components.vae_scale_factor_temporal
)
mask_lat_size = torch.concat([first_frame_mask, mask_lat_size[:, :, 1:, :]], dim=2)
mask_lat_size = mask_lat_size.view(
batch_size, -1, components.vae_scale_factor_temporal, latent_height, latent_width
)
mask_lat_size = mask_lat_size.transpose(1, 2)
mask_lat_size = mask_lat_size.to(latent_condition.device)
latent_condition = torch.concat([mask_lat_size, latent_condition], dim=1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it was indeed confusing, I think I didn't not think it through before
How about
- have a
encode_vae_video
method that just encodes videos (after the image condition is pre-processeed and converted into 5D tensor) that can be imported from diffusers and used by different blocks, including the custom one( I started to refactor SDXL and I made similar one here for imagedef encode_vae_image( - in
WanVaeEncoderStep
, it should include logic to create theviceo_condition
and put it throughencode_vae_video
to encode intolatent_condition
; The logic to create mask should stay in here too since it is closely related in how thevideo_condition
is created; we should make a differentWan*VaeEncoderStep
for 5B IT2V - In a separate
prepare_latents
should only include:- generate
randn_tensor
, - adjust the
latent_condition
andmask
based onbatch_size
(or if you just want to keep theprepare_latents
for 1, we can handle this logic somewhere else)
- generate
basically
- if user want to increase
num_videos_per_prompt
, they should not need to encode images again - if you want to use a different initial noise, you should not need to encode again
let me know what you think
another comment w.r.t auto_blocks is I think we only need to pack workflows that can use the same checkpoint into the same package, e.g. for 5B the i2v and t2v can be packaged into an autoblocks; but not for 14B t2v and i2v we do not need to combine them because you need to load diffeerent checkpoint anyway and we should be able map the checkpoint to corresponding blocks |
repos:
code (i2v):
results (i2v):
admittedly, not what i expected from the default prompt modification 🫠
output_guider_classifierfreeguidance.mp4
output_guider_adaptiveprojectedguidance.mp4
output_guider_tangentialclassifierfreeguidance.mp4
output_guider_classifierfreezerostarguidance.mp4
output_guider_perturbedattentionguidance.mp4
output_guider_autoguidance.mp4
code (flf2v):
results (flf2v):
TODO: code is currently running
code for saving pipelines: