Skip to content

Conversation

@a-r-r-o-w
Copy link
Contributor

@a-r-r-o-w a-r-r-o-w commented Mar 15, 2025

Checkpoints:

Code:

import torch
from diffusers import HunyuanVideoImageToVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import load_image, export_to_video

# Available checkpoints: "hunyuanvideo-community/HunyuanVideo-I2V" and "hunyuanvideo-community/HunyuanVideo-I2V-33ch"
model_id = "hunyuanvideo-community/HunyuanVideo-I2V"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id, subfolder="transformer", torch_dtype=torch.bfloat16
)
pipe = HunyuanVideoImageToVideoPipeline.from_pretrained(
    model_id, transformer=transformer, torch_dtype=torch.float16
)
pipe.vae.enable_tiling()
pipe.to("cuda")

prompt = "A man with short gray hair plays a red electric guitar."
image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png"
)

output = pipe(image=image, prompt=prompt).frames[0]
export_to_video(output, "output.mp4", fps=15)

Examples:

16-channel

output-hunyuani2v-3.mp4

33-channel

output-i2v-33ch.mp4

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@a-r-r-o-w a-r-r-o-w marked this pull request as ready for review March 19, 2025 16:18
temb: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
image_rotary_emb: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
*args,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's these used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the main transformer, we pass token_replace_emb and first_frame_num_tokens. The original single and double transformer blocks don't use those arguments, so we need to discard them. The args/kwargs is not used, so they are essentially discarded.

The extra arguments are only used by the new token-replace single and double blocks.

We will have to do a lot of if-else in the main transformer model forward otherwise:

image

Copy link
Contributor Author

@a-r-r-o-w a-r-r-o-w Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you prefer something like this?

block_args = [hidden_states, encoder_hidden_states, temb, attention_mask, image_rotary_emb]

if self.config.image_condition_type == "token_replace":
    block_args.extend([token_replace_emb, first_frame_num_tokens])

for block in self.transformer_blocks:
    block_args[0], block_args[1] = block(*block_args)

Then we can remove *args and **kwargs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sound good! (no need to change)

callback_on_step_end_tensor_inputs: List[str] = ["latents"],
prompt_template: Dict[str, Any] = DEFAULT_PROMPT_TEMPLATE,
max_sequence_length: int = 256,
image_embed_interleave: Optional[int] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it controls the amount of image embedding sequence tokens seen by the transformer. Previously, we had this defaulted to 2 (in encode_prompt). For the new I2V model, they make this user-configurable and it's different for 33-ch vs 16-ch model

@a-r-r-o-w
Copy link
Contributor Author

Failing tests seem unrelated

@a-r-r-o-w a-r-r-o-w requested a review from yiyixuxu March 21, 2025 04:06
Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@a-r-r-o-w a-r-r-o-w merged commit 8907a70 into main Mar 24, 2025
14 of 15 checks passed
@a-r-r-o-w a-r-r-o-w deleted the integrations/hunyuan-video-i2v-new branch March 24, 2025 15:48
@tin2tin
Copy link

tin2tin commented Mar 24, 2025

Unfortunately, both of these checkpoints, using the demo code, will crash on a RTX 4090:

chan_crash.mp4

@a-r-r-o-w
Copy link
Contributor Author

There needs to be multiple memory saving optimizations enabled for it to run on 24gb for this one :(

cc @asomoza as he'll be able to help you in this regard with examples. The latest diffusers main provides group offloading (+an option to lower cpu usage) and layerwise casting, which may be helpful for you to run the model without too much additional time overhead: https://huggingface.co/docs/diffusers/main/en/optimization/memory

@eppaneamd
Copy link

#12273

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants