Skip to content

Cogview4 pipeline not accepting prompt embeds, due to shape issues .Β #10962

@Vargol

Description

@Vargol

Describe the bug

I've trying to run CogView4 using separate pipelines to encode text and generate the image in order to save memory (Unified Memory so I can't use offloading) with the aim of doing multiple prompts

e.g.

te_pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B",
                                           transformer=None,
                                           vae=None,
                                           torch_dtype=torch.bfloat16).to("mps")

with torch.no_grad():
    prompt_embeds, negative_prompt_embeds = te_pipe.encode_prompt(
        prompt,
        negative_prompt,
        num_images_per_prompt=num_images_per_prompt,
    )

del te_pipe

pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", text_encoder=None, tokenizer=None, torch_dtype=torch.bfloat16).to("mps")

and I get a failure with the following error

ValueError: `prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but got: `prompt_embeds` torch.Size([1, 144, 4096]) != `negative_prompt_embeds` torch.Size([1, 48, 4096]).

I'm using the encode function used by the pipe and can't see why the embeds would the any different to those used internally, and if so I'm not sure why the size check is needed and assuming either the check is a bug or the size of the embeddings the encode_prompt function generates is a bug.

If I try to skip the negative embeds the code tries to generate negative prompt embeddings which fails the new pipe doesn't have an encoder.

Reproduction

from diffusers import CogView4Pipeline
import torch
import gc

te_pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B",
                                           transformer=None,
                                           vae=None,
                                           torch_dtype=torch.bfloat16).to("mps")


prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."

negative_prompt = "anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured, jpeg artefacts"

num_images_per_prompt=1

with torch.no_grad():
    prompt_embeds, negative_prompt_embeds = te_pipe.encode_prompt(
        prompt,
        negative_prompt,
        num_images_per_prompt=num_images_per_prompt,
    )

def flush():
    gc.collect()
    torch.mps.empty_cache()
    gc.collect()
    torch.mps.empty_cache()

del te_pipe.text_encoder
del te_pipe
flush()

pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", text_encoder=None, tokenizer=None, torch_dtype=torch.bfloat16).to("mps")


# Open it for reduce GPU memory usage
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

image = pipe(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_prompt_embeds,
    guidance_scale=3.5,
    num_images_per_prompt=num_images_per_prompt,
    num_inference_steps=50,
    width=1024,
    height=1024,
).images[0]

image.save("cogview4.png")

Logs

$ python cogview4_split.py 
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:00<00:00, 12.11it/s]
Loading pipeline components...: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00,  5.17it/s]
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 29.83it/s]
Loading pipeline components...: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 18.05it/s]
Traceback (most recent call last):
  File "/Volumes/SSD2TB/AI/Diffusers/cogview4_split.py", line 41, in <module>
    image = pipe(
            ^^^^^
  File "/Volumes/SSD2TB/AI/Diffusers/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/SSD2TB/AI/Diffusers/lib/python3.11/site-packages/diffusers/pipelines/cogview4/pipeline_cogview4.py", line 515, in __call__
    self.check_inputs(
  File "/Volumes/SSD2TB/AI/Diffusers/lib/python3.11/site-packages/diffusers/pipelines/cogview4/pipeline_cogview4.py", line 366, in check_inputs
    raise ValueError(
ValueError: `prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but got: `prompt_embeds` torch.Size([1, 144, 4096]) != `negative_prompt_embeds` torch.Size([1, 48, 4096]).

System Info

  • πŸ€— Diffusers version: 0.33.0.dev0
  • Platform: macOS-15.3.1-arm64-arm-64bit
  • Running on Google Colab?: No
  • Python version: 3.11.10
  • PyTorch version (GPU?): 2.6.0 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 0.27.1
  • Transformers version: 4.49.0
  • Accelerate version: 0.34.2
  • PEFT version: not installed
  • Bitsandbytes version: not installed
  • Safetensors version: 0.4.5
  • xFormers version: not installed
  • Accelerator: Apple M3
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions