Skip to content

Conversation

@a-r-r-o-w
Copy link
Contributor

@a-r-r-o-w a-r-r-o-w commented Mar 6, 2025

Thanks to the Tencent Hunyuan team for the amazing release!

Checkpoint: https://huggingface.co/hunyuanvideo-community/HunyuanVideo-I2V

Example:

import torch
from diffusers import HunyuanVideoImageToVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import load_image, export_to_video

model_id = "hunyuanvideo-community/HunyuanVideo-I2V"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id, subfolder="transformer", torch_dtype=torch.bfloat16
)
pipe = HunyuanVideoImageToVideoPipeline.from_pretrained(
    model_id, transformer=transformer, torch_dtype=torch.float16
)
pipe.vae.enable_tiling()
pipe.to("cuda")

prompt = "A man with short gray hair plays a red electric guitar."
image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png"
)

output = pipe(image=image, prompt=prompt).frames[0]
export_to_video(output, "output.mp4", fps=15)
output2.mp4

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@a-r-r-o-w a-r-r-o-w requested a review from yiyixuxu March 6, 2025 21:58
@Kaisa-Supergene
Copy link

@a-r-r-o-w Hi, I'm Kaisa Lim who is using and studying image AI using diffusers.
While testing in this PR, I got an error while calling text_encoder in ._get_llama_prompt_embeds function that the number of tokens in image_embeds and image_emb_len value in DEFAULT_PROMPT_TEMPLATE are different. Has anyone experienced a similar issue?

Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

self.vae_scale_factor_spatial = self.vae.spatial_compression_ratio if getattr(self, "vae", None) else 8
self.video_processor = VideoProcessor(vae_scale_factor=self.vae_scale_factor_spatial)

def _get_llama_prompt_embeds(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not copied from the other pipeline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has extra logic to deal with image embeddings

@a-r-r-o-w
Copy link
Contributor Author

@Kaisa-Supergene I'll take a look into that asap. I believe these values are from the official code and so, for the integration, we're going to use these anyway (even if they're incorrect). We can update on our end if it is indeed different.

https://github.com/Tencent/HunyuanVideo-I2V/blob/f1aa9a499fd06b418966bdcc7235c156c2d567d0/hyvideo/constants.py#L97

@a-r-r-o-w
Copy link
Contributor Author

Failing tests are unrelated

@a-r-r-o-w a-r-r-o-w merged commit 2e5203b into main Mar 7, 2025
14 of 15 checks passed
@a-r-r-o-w a-r-r-o-w deleted the integrations/hunyuan-i2v branch March 7, 2025 07:22
@chengzeyi
Copy link
Contributor

  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 751, in __call__
    prompt_embeds, pooled_prompt_embeds, prompt_attention_mask = self.encode_prompt(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 404, in encode_prompt
    prompt_embeds, prompt_attention_mask = self._get_llama_prompt_embeds(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 277, in _get_llama_prompt_embeds
    prompt_embeds = self.text_encoder(
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 427, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 1, features 576

I got this🧐

@Kaisa-Supergene
Copy link

  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 751, in __call__
    prompt_embeds, pooled_prompt_embeds, prompt_attention_mask = self.encode_prompt(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 404, in encode_prompt
    prompt_embeds, prompt_attention_mask = self._get_llama_prompt_embeds(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 277, in _get_llama_prompt_embeds
    prompt_embeds = self.text_encoder(
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 427, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 1, features 576

I got this🧐

i got a same issue too.
so i tried searching the issue like this in HuyuanVideo github repository, than i found the solution.
if use use HuyuanVideo model with diffusers, check your transformers version.
greater than 4.47.1 versions transformers will raise that error.
try transformers==4.47.1

@chengzeyi
Copy link
Contributor

  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 751, in __call__
    prompt_embeds, pooled_prompt_embeds, prompt_attention_mask = self.encode_prompt(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 404, in encode_prompt
    prompt_embeds, prompt_attention_mask = self._get_llama_prompt_embeds(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 277, in _get_llama_prompt_embeds
    prompt_embeds = self.text_encoder(
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 427, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 1, features 576

I got this🧐

i got a same issue too. so i tried searching the issue like this in HuyuanVideo github repository, than i found the solution. if use use HuyuanVideo model with diffusers, check your transformers version. greater than 4.47.1 versions transformers will raise that error. try transformers==4.47.1

This version gives another different error🤣

@Kaisa-Supergene
Copy link

  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 751, in __call__
    prompt_embeds, pooled_prompt_embeds, prompt_attention_mask = self.encode_prompt(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 404, in encode_prompt
    prompt_embeds, prompt_attention_mask = self._get_llama_prompt_embeds(
  File "/home/zeyi/repos/diffusers/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video.py", line 277, in _get_llama_prompt_embeds
    prompt_embeds = self.text_encoder(
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/home/zeyi/pyvenv/default/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 427, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 1, features 576

I got this🧐

i got a same issue too. so i tried searching the issue like this in HuyuanVideo github repository, than i found the solution. if use use HuyuanVideo model with diffusers, check your transformers version. greater than 4.47.1 versions transformers will raise that error. try transformers==4.47.1

This version gives another different error🤣

that is bad news lol.
i dont know how to solve this issue, but you can see this issue to solve problem myabe.
Tencent-Hunyuan/HunyuanVideo-I2V#7

@a-r-r-o-w
Copy link
Contributor Author

a-r-r-o-w commented Mar 7, 2025

I'm on the v4.48.0-dev branch of transformers during the integration. Here's my environment where it does not error out:

- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.14
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): 0.8.5 (cpu)
- Jax version: 0.4.31
- JaxLib version: 0.4.31
- Huggingface_hub version: 0.28.1
- Transformers version: 4.48.0.dev0
- Accelerate version: 1.1.0.dev0
- PEFT version: 0.14.1.dev0
- Bitsandbytes version: 0.43.3
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA DGX Display, 4096 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB

I think we might have to version guard Hunyuan-I2V if it is causing problems

@ychenZHANG
Copy link

Nice work!!

Looks like the inference scripts and model ckpt is from Tencent March 6 release. They have released another version on March 7 to fix the ID consistent bug, with 16-dim input channel to the transformer instead of 33 input channels. Any plans to adapt that as well?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants