-
Couldn't load subscription status.
- Fork 6.5k
New HunyuanVideo-I2V #11066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New HunyuanVideo-I2V #11066
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| temb: torch.Tensor, | ||
| attention_mask: Optional[torch.Tensor] = None, | ||
| image_rotary_emb: Optional[Tuple[torch.Tensor, torch.Tensor]] = None, | ||
| *args, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's these used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the main transformer, we pass token_replace_emb and first_frame_num_tokens. The original single and double transformer blocks don't use those arguments, so we need to discard them. The args/kwargs is not used, so they are essentially discarded.
The extra arguments are only used by the new token-replace single and double blocks.
We will have to do a lot of if-else in the main transformer model forward otherwise:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you prefer something like this?
block_args = [hidden_states, encoder_hidden_states, temb, attention_mask, image_rotary_emb]
if self.config.image_condition_type == "token_replace":
block_args.extend([token_replace_emb, first_frame_num_tokens])
for block in self.transformer_blocks:
block_args[0], block_args[1] = block(*block_args)Then we can remove *args and **kwargs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, sound good! (no need to change)
src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py
Show resolved
Hide resolved
| callback_on_step_end_tensor_inputs: List[str] = ["latents"], | ||
| prompt_template: Dict[str, Any] = DEFAULT_PROMPT_TEMPLATE, | ||
| max_sequence_length: int = 256, | ||
| image_embed_interleave: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it controls the amount of image embedding sequence tokens seen by the transformer. Previously, we had this defaulted to 2 (in encode_prompt). For the new I2V model, they make this user-configurable and it's different for 33-ch vs 16-ch model
src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py
Show resolved
Hide resolved
src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py
Show resolved
Hide resolved
|
Failing tests seem unrelated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py
Show resolved
Hide resolved
|
Unfortunately, both of these checkpoints, using the demo code, will crash on a RTX 4090: chan_crash.mp4 |
|
There needs to be multiple memory saving optimizations enabled for it to run on 24gb for this one :( cc @asomoza as he'll be able to help you in this regard with examples. The latest diffusers main provides group offloading (+an option to lower cpu usage) and layerwise casting, which may be helpful for you to run the model without too much additional time overhead: https://huggingface.co/docs/diffusers/main/en/optimization/memory |

Checkpoints:
Code:
Examples:
16-channel
output-hunyuani2v-3.mp4
33-channel
output-i2v-33ch.mp4