Skip to content

Conversation

@frutiemax92
Copy link

This PR enables the use of the DC VAE Encoder/Decoder with the Pixart-Sigma pipeline.
https://huggingface.co/mit-han-lab/dc-ae-f128c512-mix-1.0-diffusers

This is the code I am using to test:

from transformers import T5EncoderModel
from diffusers import PixArtSigmaPipeline, PixArtTransformer2DModel, AutoencoderDC
import torch
from datetime import datetime

text_encoder = T5EncoderModel.from_pretrained(
    "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS",
    subfolder="text_encoder",
    load_in_8bit=True,
    device_map="auto",
)
pipe = PixArtSigmaPipeline.from_pretrained(
    "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS",
    text_encoder=text_encoder,
    transformer=None,
    device_map="balanced"
)

with torch.no_grad():
    prompt = """cat
                """
    neg_prompt = 'photo'
    prompt_embeds, prompt_attention_mask, negative_embeds, negative_prompt_attention_mask = pipe.encode_prompt(prompt,
                                                                                                               neg_prompt)

import gc

def flush():
    gc.collect()
    torch.cuda.empty_cache()

del text_encoder
del pipe
flush()

prompt_embeds = prompt_embeds.repeat(2, 1, 1)
prompt_attention_mask = prompt_attention_mask.repeat(2, 1)

negative_embeds = negative_embeds.repeat(2, 1 ,1)
negative_prompt_attention_mask = negative_prompt_attention_mask.repeat(2, 1)

config = PixArtTransformer2DModel.load_config('PixArt-alpha/PixArt-Sigma-XL-2-1024-MS', subfolder='transformer')
config['in_channels'] = 512
config['out_channels'] = 1024
transformer = PixArtTransformer2DModel.from_config(config)
vae = AutoencoderDC.from_pretrained('mit-han-lab/dc-ae-f128c512-mix-1.0-diffusers').to(torch.bfloat16)
pipe = PixArtSigmaPipeline.from_pretrained(
    "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS",
    text_encoder=None,
    transformer=transformer.to(torch.bfloat16),
    torch_dtype=torch.bfloat16,
    vae=vae,
).to("cuda")

del pipe.transformer
flush()

with torch.no_grad():
    image = pipe.vae.decode(latents / pipe.vae.config.scaling_factor, return_dict=False)[0]
image = pipe.image_processor.postprocess(image, output_type="pil")[0]
image.save(f"cat.png")

This code generates noise as this is basically inference with an untrained model. The DC Encoder/Decoder has some different configuration naming conventions, so I had to adjust it and the shape of the latent generated were not the same as the original one, so I chose to clip the exceeding "pixels".

@frutiemax92 frutiemax92 force-pushed the feature_pixartsigma_dcencoder branch from 47649f3 to 405225c Compare April 7, 2025 00:31
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@hlky hlky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @frutiemax92. What's your intended use case here? As you mentioned, PixArt is not compatible with this VAE. Typically support for something in pipelines/modeling code comes after there's a model to use with it.

@frutiemax92 frutiemax92 closed this Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants