Skip to content

Conversation

@linoytsaban
Copy link
Collaborator

@linoytsaban linoytsaban commented Sep 12, 2025

https://huggingface.co/alibaba-pai/Wan2.2-VACE-Fun-A14B

diffusers format: https://huggingface.co/linoyts/Wan2.2-VACE-Fun-14B-diffusers

Example with Reference(s)-to-Video:
Notes:

  1. the boundary_ratio is set to 0.875 by default, I didn't experiment with the values
  2. the videos attached were generated with Wan2.2 VACE using lightx2v LoRA for an 8-step inference
  3. all other VACE use cases should also be applicable (see Wan VACE #11582 for more examples)
import torch
from diffusers import AutoencoderKLWan, WanVACEPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from diffusers.utils import export_to_video

model_id = "linoyts/Wan2.2-VACE-Fun-14B-diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanVACEPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
pipe.to("cuda")


import torch
import PIL.Image
from diffusers import AutoencoderKLWan, WanVACEPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from diffusers.utils import export_to_video, load_image


def prepare_video_and_mask( height: int, width: int, num_frames: int, img: PIL.Image.Image = None):
    if img is not None: 
        img = img.resize((width, height))
        frames = [img]
        # Ideally, this should be 127.5 to match original code, but they perform computation on numpy arrays
        # whereas we are passing PIL images. If you choose to pass numpy arrays, you can set it to 127.5 to
        # match the original code.
        frames.extend([PIL.Image.new("RGB", (width, height), (128, 128, 128))] * (num_frames - 1))
        mask_black = PIL.Image.new("L", (width, height), 0)
        mask_white = PIL.Image.new("L", (width, height), 255)
        mask = [mask_black, *[mask_white] * (num_frames - 1)]
    else:
        frames = []
        # Ideally, this should be 127.5 to match original code, but they perform computation on numpy arrays
        # whereas we are passing PIL images. If you choose to pass numpy arrays, you can set it to 127.5 to
        # match the original code.
        frames.extend([PIL.Image.new("RGB", (width, height), (128, 128, 128))] * (num_frames))
        mask_white = PIL.Image.new("L", (width, height), 255)
        mask = [mask_white] * (num_frames)
    return frames, mask

prompt = "the robot is wearing the sunglasses and the hat that reads 'GPU poor' and playfully moves around"  
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, typos, style, works, paintings, spelling mistakes, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

height = 480
width = 832
num_frames = 45
video, mask = prepare_video_and_mask(height, width, num_frames)
reference_images = [load_image("reachy.jpg"), load_image("sunglasses.jpg"),load_image("gpu_hat.png") ]

output = pipe(
    video=video,
    mask=mask,
    prompt=prompt,
    reference_images=reference_images,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=num_frames,
    num_inference_steps=30,
    guidance_scale=5.0,
    generator=torch.Generator().manual_seed(42),
).frames[0]
export_to_video(output, "output_VACE_ref.mp4", fps=16)

to use with the fast inference LoRA:

pipe.load_lora_weights(
        "Kijai/WanVideo_comfy", 
        weight_name="Lightx2v/lightx2v_I2V_14B_480p_cfg_step_distill_rank128_bf16.safetensors", 
        adapter_name="lightx2v"
    )
kwargs_lora = {}
kwargs_lora["load_into_transformer_2"] = True
pipe.load_lora_weights(
    "Kijai/WanVideo_comfy", 
    weight_name="Lightx2v/lightx2v_I2V_14B_480p_cfg_step_distill_rank128_bf16.safetensors", 
    adapter_name="lightx2v_2", **kwargs_lora
)
pipe.set_adapters(["lightx2v", "lightx2v_2"], adapter_weights=[1., 1.])
pipe.fuse_lora(adapter_names=["lightx2v"], lora_scale=3., components=["transformer"]) 
pipe.fuse_lora(adapter_names=["lightx2v_2"], lora_scale=1., components=["transformer_2"])
pipe.unload_lora_weights()

output = pipe(
    video=video,
    mask=mask,
    prompt=prompt,
    reference_images=reference_images,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=num_frames,
    num_inference_steps=8, # 6-10 is probably a good range
    guidance_scale=1.0, # advised to use 1.0
    generator=torch.Generator().manual_seed(42),
).frames[0]
export_to_video(output, "output_VACE_ref.mp4", fps=16)
output_video-6.mp4
output_video-8.mp4

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@linoytsaban linoytsaban marked this pull request as ready for review September 12, 2025 17:07
Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @linoytsaban !

@linoytsaban
Copy link
Collaborator Author

@bot /style

@github-actions
Copy link
Contributor

github-actions bot commented Sep 13, 2025

Style bot fixed some files and pushed the changes.

@sayakpaul
Copy link
Member

Could we check if the failing test is not being introduced in this PR?

@J4BEZ
Copy link
Contributor

J4BEZ commented Sep 15, 2025

Very Awesome!
I really appreciate your hard work🙇‍♂️

@linoytsaban
Copy link
Collaborator Author

linoytsaban commented Sep 15, 2025

@sayakpaul @yiyixuxu I think current failing test is not related

@sayakpaul
Copy link
Member

Indeed. The failure I pointed out has now gone 👍 Thanks for the work, Linoy!

@sayakpaul sayakpaul merged commit b500140 into huggingface:main Sep 15, 2025
9 of 10 checks passed
@bhack
Copy link

bhack commented Sep 16, 2025

@linoytsaban Does this support Masked V2V?

@luke14free
Copy link

@linoytsaban I noticed that using the lightx2v lora causes a lot of warnings about mismatching layers in console and also produces much worse results than yours. maybe it's the wrong lora link?

@linoytsaban linoytsaban deleted the vace_22 branch September 16, 2025 12:35
@00Neil
Copy link

00Neil commented Sep 17, 2025

Thank you for your hard work on this! I'm wondering if this model supports multi-GPU inference. The reason I ask is that I currently have 8 RTX 4090 graphics cards available, and using a single 4090 leads to an out-of-memory (OOM) error.

@sayakpaul
Copy link
Member

@00Neil we don't yet support exotic forms of parallelism within the library. #11941 is in the works.

We have some guidance on how to reduce memory consumption and other speedup-related things we support from the library:

@bhack
Copy link

bhack commented Sep 18, 2025

@sayakpaul @linoytsaban MV2V was just commited upstream:
aigc-apps/VideoX-Fun#328

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants