-
Couldn't load subscription status.
- Fork 6.5k
[modular diffusers] Wan I2V/FLF2V #11997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking great! thanks @a-r-r-o-w
I left some comments, let me know what you think! from here I think we it's very easy to support wan 2.2!
| return components, state | ||
|
|
||
|
|
||
| class WanVaeEncoderStep(PipelineBlock): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the vae enoder step should just encode the image and return the image_latents
the rest of logic should go into prepare_latents
This way it's more "modular", both for developing and using
e.g. at runtime, if you only want to change first or last frame you only need to encode one of them and use the image_latents directly; same if you want to change num_frames
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, I understand now. I missed the PrepareLatents nodes specific to img2img example in SDXL. Will implement it correctly soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yiyixuxu I was taking a look at this today. A little confused by what is entailed by "rest of the logic" here. Is it just the following part, or more?
mask_lat_size = torch.ones(batch_size, 1, num_frames, latent_height, latent_width)
if last_image is None:
mask_lat_size[:, :, list(range(1, num_frames))] = 0
else:
mask_lat_size[:, :, list(range(1, num_frames - 1))] = 0
first_frame_mask = mask_lat_size[:, :, 0:1]
first_frame_mask = torch.repeat_interleave(
first_frame_mask, dim=2, repeats=components.vae_scale_factor_temporal
)
mask_lat_size = torch.concat([first_frame_mask, mask_lat_size[:, :, 1:, :]], dim=2)
mask_lat_size = mask_lat_size.view(
batch_size, -1, components.vae_scale_factor_temporal, latent_height, latent_width
)
mask_lat_size = mask_lat_size.transpose(1, 2)
mask_lat_size = mask_lat_size.to(latent_condition.device)
latent_condition = torch.concat([mask_lat_size, latent_condition], dim=1)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it was indeed confusing, I think I didn't not think it through before
How about
- have a
encode_vae_videomethod that just encodes videos (after the image condition is pre-processeed and converted into 5D tensor) that can be imported from diffusers and used by different blocks, including the custom one( I started to refactor SDXL and I made similar one here for image)def encode_vae_image( - in
WanVaeEncoderStep, it should include logic to create theviceo_conditionand put it throughencode_vae_videoto encode intolatent_condition; The logic to create mask should stay in here too since it is closely related in how thevideo_conditionis created; we should make a differentWan*VaeEncoderStepfor 5B IT2V - In a separate
prepare_latentsshould only include:- generate
randn_tensor, - adjust the
latent_conditionandmaskbased onbatch_size(or if you just want to keep theprepare_latentsfor 1, we can handle this logic somewhere else)
- generate
basically
- if user want to increase
num_videos_per_prompt, they should not need to encode images again - if you want to use a different initial noise, you should not need to encode again
let me know what you think
|
another comment w.r.t auto_blocks is I think we only need to pack workflows that can use the same checkpoint into the same package, e.g. for 5B the i2v and t2v can be packaged into an autoblocks; but not for 14B t2v and i2v we do not need to combine them because you need to load diffeerent checkpoint anyway and we should be able map the checkpoint to corresponding blocks |
|
Uh, has this one gone out of sight - out of mind? |
repos:
code (i2v):
results (i2v):
admittedly, not what i expected from the default prompt modification 🫠
output_guider_classifierfreeguidance.mp4
output_guider_adaptiveprojectedguidance.mp4
output_guider_tangentialclassifierfreeguidance.mp4
output_guider_classifierfreezerostarguidance.mp4
output_guider_perturbedattentionguidance.mp4
output_guider_autoguidance.mp4
code (flf2v):
results (flf2v):
TODO: code is currently running
code for saving pipelines: