-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[wip] Text to Image Pipeline for wan #12093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] Text to Image Pipeline for wan #12093
Conversation
|
[wip] I need to figure out why images are being generated as oversaturated/very red, think I am missing a video processing step after generation |
|
Hi @luke14free , what would be the difference between using the normal pipeline with a |
|
hey @asomoza it is pretty much the same, except you don't also load transformer (the high noise one). you can 100% make it work with existing pipelines. just thought it would have been cleaner to have a separate pipeline, but in case it's not i'll close |
|
@luke14free not completely sure (haven't tested it yet) but I think you can pass a |
|
indeed you can, I tested it myself and it's what I am doing here. I think you are right it makes little sense to have this, I'll close it anyways |
vae = AutoencoderKLWan.from_pretrained(
"Wan-AI/Wan2.2-T2V-A14B-Diffusers",
subfolder="vae",
torch_dtype=torch.float32
)
transformer_low_noise = WanTransformer3DModel.from_pretrained(
"Wan-AI/Wan2.2-T2V-A14B-Diffusers",
subfolder="transformer_2",
torch_dtype=torch.bfloat16,
)
self.pipe = WanPipeline.from_pretrained(
"Wan-AI/Wan2.2-T2V-A14B-Diffusers",
vae=vae,
transformer=transformer_low_noise,
boundary_ratio=None,
transformer_2=None,
torch_dtype=torch.bfloat16
)
output = pipe(
prompt=input_data.prompt,
negative_prompt=input_data.negative_prompt,
height=height,
width=width,
num_frames=1, # Required for text-to-image to create proper temporal dimension
guidance_scale=input_data.guidance_scale,
num_inference_steps=input_data.num_inference_steps,
).frames[0]for anyone interested, this is how you do it with the existing pipelines :) |
What does this PR do?
Adds a new pipeline to support Wan low-noise transformer only image generation, which is getting a lot of traction (esp with new loras on civit).
Since this new pipeline is just a reshuffle of the components of t2v one, I am quite unsure if what I did by overwriting from_pretrained makes sense/is acceptable. I also had to add a new return type which again might not follow your style, feel free to ask for changes.
The general idea of the pipeline is that I am only using the transformer_2 (the low noise) which is being loaded as the main transformer. Then I perform the prediction of just one frame and return that. It works with ggufs as expected by passing a gguf transformer in the pipeline.
Who can review?
maybe @yiyixuxu ? not sure who is the right person