Skip to content

Conversation

@yiyixuxu
Copy link
Collaborator

@yiyixuxu yiyixuxu commented Feb 27, 2025

for #10921

@yiyixuxu
Copy link
Collaborator Author

yiyixuxu commented Feb 28, 2025

this for now

import torch
from transformers import AutoTokenizer, UMT5EncoderModel
from diffusers import AutoencoderKLWan, WanPipeline, WanTransformer3DModel, FlowMatchEulerDiscreteScheduler
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from diffusers.utils import export_to_video
from torchvision import transforms
import os
import cv2
import numpy as np


from pathlib import Path
import json
from safetensors.torch import safe_open

device = "cuda"
seed = 0

vae_repo = "/raid/yiyi/wan2.1_vae_diffusers"
vae = AutoencoderKLWan.from_pretrained(vae_repo)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vae = vae.to(device)

# TODO: impl FlowDPMSolverMultistepScheduler
scheduler = UniPCMultistepScheduler(prediction_type='flow_prediction', use_flow_sigmas=True, num_train_timesteps=1000, flow_shift=1.0)

text_encoder = UMT5EncoderModel.from_pretrained("google/umt5-xxl", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/umt5-xxl")

# 14B
# transformer = WanTransformer3DModel.from_pretrained('StevenZhang/Wan2.1-T2V-14B-Diff', torch_dtype=torch.bfloat16)
transformer = WanTransformer3DModel.from_pretrained('StevenZhang/Wan2.1-T2V-1.3B-Diff', torch_dtype=torch.bfloat16)

components = {
    "transformer": transformer,
    "vae": vae,
    "scheduler": scheduler,
    "text_encoder": text_encoder,
    "tokenizer": tokenizer,
}
pipe = WanPipeline(**components)

pipe.to(device)

negative_prompt = '色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走'

generator = torch.Generator(device=device).manual_seed(seed)
inputs = {
    "prompt": "两只拟人化的猫咪身穿舒适的拳击装备,戴着鲜艳的手套,在聚光灯照射的舞台上激烈对战",
    "negative_prompt": negative_prompt, # TODO
    "generator": generator,
    "num_inference_steps": 50,
    "flow_shift": 3.0,
    "guidance_scale": 5.0,
    "height": 480,
    "width": 832,
    "num_frames": 81,
    "max_sequence_length": 512,
    "output_type": "np"
}

video = pipe(**inputs).frames[0]

print(video.shape)

export_to_video(video, "output.mp4", fps=16)

yiyixuxu and others added 4 commits February 28, 2025 09:59
* update

* update

* refactor rope

* refactor pipeline

* make fix-copies

* add transformer test

* update

* update
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@yiyixuxu
Copy link
Collaborator Author

yiyixuxu commented Mar 2, 2025

@bot /style

@a-r-r-o-w a-r-r-o-w added the roadmap Add to current release roadmap label Mar 2, 2025
@a-r-r-o-w a-r-r-o-w merged commit 2d8a41c into main Mar 2, 2025
26 of 30 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Diffusers Roadmap 0.36 Mar 2, 2025
@a-r-r-o-w a-r-r-o-w deleted the yiyi-refactor-wan-vae branch March 2, 2025 11:54
@X-niper
Copy link

X-niper commented Mar 5, 2025

In WanVAE encoder, the codes suggest offset and multiply the log variance directly using the same factor as mu.
However, the gaussian standard deviation should use the same factor.

Is the implementation in main branch correct? @a-r-r-o-w
mu = (mu - scale[0].view(1, self.z_dim, 1, 1, 1)) * scale[1].view(1, self.z_dim, 1, 1, 1)
# the original logvar = (logvar - scale[0].view(1, self.z_dim, 1, 1, 1)) * scale[1].view(1, self.z_dim, 1, 1, 1)
logvar = logvar + 2 * torch.log(scale[1].view(1, self.z_dim, 1, 1, 1)) # the proposal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

roadmap Add to current release roadmap

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants