Skip to content

SD3 failing with tensor mismatch when using negative prompt embeddings & num_images_per_prompt > 1 #12299

@danielamassiceti

Description

@danielamassiceti

Describe the bug

Hi there

I am attempting to get around the 77 CLIP token limit when using SD3 (Medium). To do this, I am using a package called sd_embed which chunks up my positive and negative prompts and embeds them. Rather than passing the prompt strings to the SD3 pipeline, I am then instead passing the embeddings as:

pipeline_kwargs = dict(
      num_images_per_prompt=num_images,
      num_inference_steps=self.num_inference_steps,
      height=self.image_height,
      width=self.image_width,
      guidance_scale=self.guidance_scale,
      generator=generator,
      **kwargs,
  )

if use_embeddings:
    response = self.pipeline(  # type: ignore
        **pipeline_kwargs,
        prompt_embeds=prompt_embeds,
        negative_prompt_embeds=negative_prompt_embeds,
        pooled_prompt_embeds=pooled_prompt_embeds,
        negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
    )

However, I am getting a tensor mismatch error which seems to be stemming from the fact that I am num_images_per_prompt > 1 (it doesn't give me any issues when num_images_per_prompt = 1).

I have traced the error as starting from this function call the pipeline_stable_diffusion_3.py file.

Specifically, my dimensions are as follows:

  • latent_model_input is torch.Size([10, 16, 128, 128])
  • timestep is torch.Size([10])
  • prompt_embeds is torch.Size([2, 297, 4096])
  • pooled_prompt_embeds is torch.Size([2, 2048])

Within this function call, the failure stems from this line in transformer_sd3.py which stems for this line in embeddings.py. The line is:

    conditioning = timesteps_emb + pooled_projections

The core of the issue is that timesteps_emb is of size torch.Size([10, 1536]) while pooled_projections is of size torch.Size([2, 1536]) (since it stems from pooled_prompt_embeds).

Reproduction

import gc

import torch
from diffusers import StableDiffusion3Pipeline
from sd_embed.embedding_funcs import get_weighted_text_embeddings_sd3

model_path = "stabilityai/stable-diffusion-3-medium-diffusers"
pipe = StableDiffusion3Pipeline.from_pretrained(model_path, torch_dtype=torch.float16)
pipe.to("cuda")

prompt = """A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus.
This imaginative creature features the distinctive, bulky body of a hippo,
but with a texture and appearance resembling a golden-brown, crispy waffle.
The creature might have elements like waffle squares across its skin and a syrup-like sheen.
It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting,
possibly including oversized utensils or plates in the background.
The image should evoke a sense of playful absurdity and culinary fantasy.
"""

neg_prompt = """
skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),
(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),
extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),
(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,
bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),
(normal quality:2),lowres,((monochrome)),((grayscale))
"""

(prompt_embeds, prompt_neg_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds) = get_weighted_text_embeddings_sd3(
    pipe, prompt=prompt, neg_prompt=neg_prompt
)

image = pipe(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=prompt_neg_embeds,
    pooled_prompt_embeds=pooled_prompt_embeds,
    negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
    num_images_per_prompt=5,
    num_inference_steps=30,
    height=1024,
    width=1024 + 512,
    guidance_scale=4.0,
    generator=torch.Generator("cuda").manual_seed(2),
).images[0]
image.save("sd3_waffle_hippo.png")

del prompt_embeds, prompt_neg_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds
pipe.to("cpu")
gc.collect()
torch.cuda.empty_cache()

Logs

This should fail with the following error:
RuntimeError: The size of tensor a (10) must match the size of tensor b (2) at non-singleton dimension 0

System Info

  • 🤗 Diffusers version: 0.35.1
  • Platform: Linux-6.8.0-71-generic-x86_64-with-glibc2.39
  • Running on Google Colab?: No
  • Python version: 3.12.3
  • PyTorch version (GPU?): 2.8.0+cu128 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 0.34.4
  • Transformers version: 4.51.3
  • Accelerate version: 0.32.1
  • PEFT version: 0.17.1
  • Bitsandbytes version: not installed
  • Safetensors version: 0.5.3
  • xFormers version: not installed
  • Accelerator: NVIDIA RTX A6000, 46068 MiB
    NVIDIA RTX A6000, 46068 MiB
    NVIDIA RTX A6000, 46068 MiB
    NVIDIA RTX A6000, 46068 MiB
  • Using GPU in script?: NVIDIA RTX A6000
  • Using distributed or parallel set-up in script?: just 1 GPU

Who can help?

@yiyixuxu @sayakpaul @DN6 @asomoza

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions