Skip to content

Conversation

@luke14free
Copy link

@luke14free luke14free commented Aug 7, 2025

What does this PR do?

Adds a new pipeline to support Wan low-noise transformer only image generation, which is getting a lot of traction (esp with new loras on civit).

Since this new pipeline is just a reshuffle of the components of t2v one, I am quite unsure if what I did by overwriting from_pretrained makes sense/is acceptable. I also had to add a new return type which again might not follow your style, feel free to ask for changes.

The general idea of the pipeline is that I am only using the transformer_2 (the low noise) which is being loaded as the main transformer. Then I perform the prediction of just one frame and return that. It works with ggufs as expected by passing a gguf transformer in the pipeline.

Who can review?

maybe @yiyixuxu ? not sure who is the right person

@luke14free luke14free changed the title Text to Image Pipeline for wan [wip] Text to Image Pipeline for wan Aug 7, 2025
@luke14free
Copy link
Author

[wip] I need to figure out why images are being generated as oversaturated/very red, think I am missing a video processing step after generation

@asomoza
Copy link
Member

asomoza commented Aug 7, 2025

Hi @luke14free , what would be the difference between using the normal pipeline with a boundary_ratio = 1.0 and num_frames=1 and this?

@luke14free
Copy link
Author

hey @asomoza it is pretty much the same, except you don't also load transformer (the high noise one). you can 100% make it work with existing pipelines. just thought it would have been cleaner to have a separate pipeline, but in case it's not i'll close

@asomoza
Copy link
Member

asomoza commented Aug 7, 2025

@luke14free not completely sure (haven't tested it yet) but I think you can pass a transformer=None so it doesn't load. You can keep it open if you want and see if there's some interest in having a specific pipeline for this but mostly we don't create additional pipelines when you can do the same with the base one.

@luke14free
Copy link
Author

indeed you can, I tested it myself and it's what I am doing here. I think you are right it makes little sense to have this, I'll close it anyways

@luke14free luke14free closed this Aug 7, 2025
@luke14free
Copy link
Author

luke14free commented Aug 7, 2025

vae = AutoencoderKLWan.from_pretrained(
    "Wan-AI/Wan2.2-T2V-A14B-Diffusers", 
    subfolder="vae", 
    torch_dtype=torch.float32
)

transformer_low_noise = WanTransformer3DModel.from_pretrained(
    "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
    subfolder="transformer_2",
    torch_dtype=torch.bfloat16,
)

self.pipe = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.2-T2V-A14B-Diffusers",
    vae=vae,
    transformer=transformer_low_noise,
    boundary_ratio=None,
    transformer_2=None,
    torch_dtype=torch.bfloat16
)

output = pipe(
    prompt=input_data.prompt,
    negative_prompt=input_data.negative_prompt,
    height=height,
    width=width,
    num_frames=1,  # Required for text-to-image to create proper temporal dimension
    guidance_scale=input_data.guidance_scale,
    num_inference_steps=input_data.num_inference_steps,
).frames[0]

for anyone interested, this is how you do it with the existing pipelines :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants