Hunyuan I2V #10983

a-r-r-o-w · 2025-03-06T11:29:21Z

Thanks to the Tencent Hunyuan team for the amazing release!

Checkpoint: https://huggingface.co/hunyuanvideo-community/HunyuanVideo-I2V

Example:

import torch
from diffusers import HunyuanVideoImageToVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import load_image, export_to_video

model_id = "hunyuanvideo-community/HunyuanVideo-I2V"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id, subfolder="transformer", torch_dtype=torch.bfloat16
)
pipe = HunyuanVideoImageToVideoPipeline.from_pretrained(
    model_id, transformer=transformer, torch_dtype=torch.float16
)
pipe.vae.enable_tiling()
pipe.to("cuda")

prompt = "A man with short gray hair plays a red electric guitar."
image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png"
)

output = pipe(image=image, prompt=prompt).frames[0]
export_to_video(output, "output.mp4", fps=15)

output2.mp4

HuggingFaceDocBuilderDev · 2025-03-06T11:35:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Kaisa-Supergene · 2025-03-07T02:28:49Z

@a-r-r-o-w Hi, I'm Kaisa Lim who is using and studying image AI using diffusers.
While testing in this PR, I got an error while calling text_encoder in ._get_llama_prompt_embeds function that the number of tokens in image_embeds and image_emb_len value in DEFAULT_PROMPT_TEMPLATE are different. Has anyone experienced a similar issue?

yiyixuxu

thanks!

yiyixuxu · 2025-03-07T04:39:03Z

src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py

+        self.vae_scale_factor_spatial = self.vae.spatial_compression_ratio if getattr(self, "vae", None) else 8
+        self.video_processor = VideoProcessor(vae_scale_factor=self.vae_scale_factor_spatial)
+
+    def _get_llama_prompt_embeds(


it's not copied from the other pipeline?

It has extra logic to deal with image embeddings

a-r-r-o-w · 2025-03-07T06:51:42Z

@Kaisa-Supergene I'll take a look into that asap. I believe these values are from the official code and so, for the integration, we're going to use these anyway (even if they're incorrect). We can update on our end if it is indeed different.

https://github.com/Tencent/HunyuanVideo-I2V/blob/f1aa9a499fd06b418966bdcc7235c156c2d567d0/hyvideo/constants.py#L97

a-r-r-o-w · 2025-03-07T07:22:39Z

Failing tests are unrelated

chengzeyi · 2025-03-07T12:25:34Z

  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 751, in __call__
    prompt_embeds, pooled_prompt_embeds, prompt_attention_mask = self.encode_prompt(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 404, in encode_prompt
    prompt_embeds, prompt_attention_mask = self._get_llama_prompt_embeds(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 277, in _get_llama_prompt_embeds
    prompt_embeds = self.text_encoder(
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 427, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 1, features 576

I got this🧐

Kaisa-Supergene · 2025-03-07T12:29:46Z

  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 751, in __call__
    prompt_embeds, pooled_prompt_embeds, prompt_attention_mask = self.encode_prompt(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 404, in encode_prompt
    prompt_embeds, prompt_attention_mask = self._get_llama_prompt_embeds(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 277, in _get_llama_prompt_embeds
    prompt_embeds = self.text_encoder(
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 427, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 1, features 576

I got this🧐

i got a same issue too.
so i tried searching the issue like this in HuyuanVideo github repository, than i found the solution.
if use use HuyuanVideo model with diffusers, check your transformers version.
greater than 4.47.1 versions transformers will raise that error.
try transformers==4.47.1

chengzeyi · 2025-03-07T12:38:21Z

  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 751, in __call__
    prompt_embeds, pooled_prompt_embeds, prompt_attention_mask = self.encode_prompt(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 404, in encode_prompt
    prompt_embeds, prompt_attention_mask = self._get_llama_prompt_embeds(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 277, in _get_llama_prompt_embeds
    prompt_embeds = self.text_encoder(
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 427, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 1, features 576

I got this🧐

i got a same issue too. so i tried searching the issue like this in HuyuanVideo github repository, than i found the solution. if use use HuyuanVideo model with diffusers, check your transformers version. greater than 4.47.1 versions transformers will raise that error. try transformers==4.47.1

This version gives another different error🤣

Kaisa-Supergene · 2025-03-07T12:41:11Z

  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 751, in __call__
    prompt_embeds, pooled_prompt_embeds, prompt_attention_mask = self.encode_prompt(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 404, in encode_prompt
    prompt_embeds, prompt_attention_mask = self._get_llama_prompt_embeds(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 277, in _get_llama_prompt_embeds
    prompt_embeds = self.text_encoder(
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 427, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 1, features 576

I got this🧐

i got a same issue too. so i tried searching the issue like this in HuyuanVideo github repository, than i found the solution. if use use HuyuanVideo model with diffusers, check your transformers version. greater than 4.47.1 versions transformers will raise that error. try transformers==4.47.1

This version gives another different error🤣

that is bad news lol.
i dont know how to solve this issue, but you can see this issue to solve problem myabe.
Tencent-Hunyuan/HunyuanVideo-I2V#7

a-r-r-o-w · 2025-03-07T14:31:59Z

I'm on the v4.48.0-dev branch of transformers during the integration. Here's my environment where it does not error out:

- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.14
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): 0.8.5 (cpu)
- Jax version: 0.4.31
- JaxLib version: 0.4.31
- Huggingface_hub version: 0.28.1
- Transformers version: 4.48.0.dev0
- Accelerate version: 1.1.0.dev0
- PEFT version: 0.14.1.dev0
- Bitsandbytes version: 0.43.3
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA DGX Display, 4096 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB

I think we might have to version guard Hunyuan-I2V if it is causing problems

ychenZHANG · 2025-03-11T06:10:06Z

Nice work!!

Looks like the inference scripts and model ckpt is from Tencent March 6 release. They have released another version on March 7 to fix the ID consistent bug, with 16-dim input channel to the transformer instead of 33 input channels. Any plans to adapt that as well?

Thank you!

a-r-r-o-w added 2 commits March 6, 2025 11:49

update

ab2476b

update

77abad3

a-r-r-o-w added 8 commits March 6, 2025 13:43

update

655dcda

add tests

1e6ada6

update

e978876

add model tests

e13231c

update docs

a879a22

update

0a5a820

update example

f6a07e5

fix defaults

39a1ce8

a-r-r-o-w requested a review from yiyixuxu March 6, 2025 21:58

yiyixuxu approved these changes Mar 7, 2025

View reviewed changes

update

ab6c463

a-r-r-o-w merged commit 2e5203b into main Mar 7, 2025
14 of 15 checks passed

a-r-r-o-w deleted the integrations/hunyuan-i2v branch March 7, 2025 07:22

tolgacangoz mentioned this pull request Mar 12, 2025

withdrawn huggingface/finetrainers#319

Closed

vladmandic mentioned this pull request Mar 19, 2025

HunyuanVideoImageToVideoPipeline failures #11118

Closed

Hunyuan I2V #10983

Hunyuan I2V #10983

Uh oh!

Conversation

a-r-r-o-w commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 6, 2025

Uh oh!

Kaisa-Supergene commented Mar 7, 2025

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

yiyixuxu Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w commented Mar 7, 2025

Uh oh!

a-r-r-o-w commented Mar 7, 2025

Uh oh!

Uh oh!

chengzeyi commented Mar 7, 2025

Uh oh!

Kaisa-Supergene commented Mar 7, 2025

Uh oh!

chengzeyi commented Mar 7, 2025

Uh oh!

Kaisa-Supergene commented Mar 7, 2025

Uh oh!

a-r-r-o-w commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ychenZHANG commented Mar 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

a-r-r-o-w commented Mar 6, 2025 •

edited

Loading

a-r-r-o-w commented Mar 7, 2025 •

edited

Loading