-
Notifications
You must be signed in to change notification settings - Fork 6.5k
Description
Describe the bug
pipline HunyuanVideoImageToVideoPipeline fails with latest combination of diffusers and transformers libraries
first, minor issue is with offloading - this snipped updates pipeline_hunyuan_video_image2video.py to add explicit .to(device) so two torch.cat operations do not fail.
if last_double_return_token_indices.shape[0] == 3:
# in case the prompt is too long
last_double_return_token_indices = torch.cat(
(last_double_return_token_indices, torch.tensor([text_input_ids.shape[-1]], device=last_double_return_token_indices.device))
)
batch_indices = torch.cat((batch_indices, torch.tensor([0], device=batch_indices.device)))bigger issue is that transformers updated how image embeds work in LlavaForConditionalGeneration,
so function _get_llama_prompt_embeds in HunyuanVideoImageToVideoPipeline needs an update
(last version of transformers that works is transformers==4.47.1)
specifically, it returns prompt_embeds and prompt_attention_mask which don't have the same length due to way that cropping is implemented, so later cannot be combined in HunyuanVideoTokenRefiner:
Reproduction
see #10983 for simple example
Logs
β /home/vlado/dev/sdnext/venv/lib/python3.12/site-packages/diffusers/models/transformers/transformer_hunyuan_video.py:312 in forward β
β β
β 311 β β β mask_float = attention_mask.float().unsqueeze(-1) β
β β± 312 β β β pooled_projections = (hidden_states * mask_float).sum(dim=1) / mask_float.sum(dim=1) β
β 313 β β β pooled_projections = pooled_projections.to(original_dtype) β
β°ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ―
RuntimeError: The size of tensor a (177) must match the size of tensor b (429) at non-singleton dimension 1System Info
diffusers==main
transformers==4.49.0
Who can help?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status