New HunyuanVideo-I2V #11066

a-r-r-o-w · 2025-03-15T02:37:33Z

Checkpoints:

33-channel (old): https://huggingface.co/hunyuanvideo-community/HunyuanVideo-I2V-33ch
16-channel (new): https://huggingface.co/hunyuanvideo-community/HunyuanVideo-I2V

Code:

import torch
from diffusers import HunyuanVideoImageToVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import load_image, export_to_video

# Available checkpoints: "hunyuanvideo-community/HunyuanVideo-I2V" and "hunyuanvideo-community/HunyuanVideo-I2V-33ch"
model_id = "hunyuanvideo-community/HunyuanVideo-I2V"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id, subfolder="transformer", torch_dtype=torch.bfloat16
)
pipe = HunyuanVideoImageToVideoPipeline.from_pretrained(
    model_id, transformer=transformer, torch_dtype=torch.float16
)
pipe.vae.enable_tiling()
pipe.to("cuda")

prompt = "A man with short gray hair plays a red electric guitar."
image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png"
)

output = pipe(image=image, prompt=prompt).frames[0]
export_to_video(output, "output.mp4", fps=15)

Examples:

16-channel

output-hunyuani2v-3.mp4

33-channel

output-i2v-33ch.mp4

src/diffusers/models/transformers/transformer_hunyuan_video.py

HuggingFaceDocBuilderDev · 2025-03-15T02:43:47Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yiyixuxu · 2025-03-20T06:13:54Z

src/diffusers/models/transformers/transformer_hunyuan_video.py

        temb: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        image_rotary_emb: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        *args,


what's these used for?

In the main transformer, we pass token_replace_emb and first_frame_num_tokens. The original single and double transformer blocks don't use those arguments, so we need to discard them. The args/kwargs is not used, so they are essentially discarded.

The extra arguments are only used by the new token-replace single and double blocks.

We will have to do a lot of if-else in the main transformer model forward otherwise:

Do you prefer something like this?

block_args = [hidden_states, encoder_hidden_states, temb, attention_mask, image_rotary_emb] if self.config.image_condition_type == "token_replace": block_args.extend([token_replace_emb, first_frame_num_tokens]) for block in self.transformer_blocks: block_args[0], block_args[1] = block(*block_args)

Then we can remove *args and **kwargs

ok, sound good! (no need to change)

src/diffusers/models/transformers/transformer_hunyuan_video.py

src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py

yiyixuxu · 2025-03-20T06:23:05Z

src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py

        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
        prompt_template: Dict[str, Any] = DEFAULT_PROMPT_TEMPLATE,
        max_sequence_length: int = 256,
+        image_embed_interleave: Optional[int] = None,


what is this argument?

I believe it controls the amount of image embedding sequence tokens seen by the transformer. Previously, we had this defaulted to 2 (in encode_prompt). For the new I2V model, they make this user-configurable and it's different for 33-ch vs 16-ch model

src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py

a-r-r-o-w · 2025-03-21T04:05:58Z

Failing tests seem unrelated

yiyixuxu

thanks!

src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py

src/diffusers/models/transformers/transformer_hunyuan_video.py

tin2tin · 2025-03-24T18:52:18Z

Unfortunately, both of these checkpoints, using the demo code, will crash on a RTX 4090:

chan_crash.mp4

a-r-r-o-w · 2025-03-24T19:02:47Z

There needs to be multiple memory saving optimizations enabled for it to run on 24gb for this one :(

cc @asomoza as he'll be able to help you in this regard with examples. The latest diffusers main provides group offloading (+an option to lower cpu usage) and layerwise casting, which may be helpful for you to run the model without too much additional time overhead: https://huggingface.co/docs/diffusers/main/en/optimization/memory

eppaneamd · 2025-09-03T06:38:24Z

#12273

update

5da0839

a-r-r-o-w commented Mar 15, 2025

View reviewed changes

src/diffusers/models/transformers/transformer_hunyuan_video.py Show resolved Hide resolved

a-r-r-o-w added 3 commits March 17, 2025 23:53

update

af24bea

Merge branch 'main' into integrations/hunyuan-video-i2v-new

85fc267

update

2846939

a-r-r-o-w marked this pull request as ready for review March 19, 2025 16:18

a-r-r-o-w added 3 commits March 20, 2025 06:24

add tests

4a0481e

update docs

7b8dd64

Merge branch 'main' into integrations/hunyuan-video-i2v-new

5c63cc0