SD3 failing with tensor mismatch when using negative prompt embeddings & num_images_per_prompt > 1

### Describe the bug

Hi there

I am attempting to get around the 77 CLIP token limit when using SD3 (Medium). To do this, I am using a package called [sd_embed](https://github.com/xhinker/sd_embed/tree/main) which chunks up my positive and negative prompts and embeds them. Rather than passing the prompt strings to the SD3 pipeline, I am then instead passing the embeddings as:
```
pipeline_kwargs = dict(
      num_images_per_prompt=num_images,
      num_inference_steps=self.num_inference_steps,
      height=self.image_height,
      width=self.image_width,
      guidance_scale=self.guidance_scale,
      generator=generator,
      **kwargs,
  )

if use_embeddings:
    response = self.pipeline(  # type: ignore
        **pipeline_kwargs,
        prompt_embeds=prompt_embeds,
        negative_prompt_embeds=negative_prompt_embeds,
        pooled_prompt_embeds=pooled_prompt_embeds,
        negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
    )
```

However, I am getting a tensor mismatch error which seems to be stemming from the fact that I am `num_images_per_prompt > 1` (it doesn't give me any issues when `num_images_per_prompt = 1`).

I have traced the error as starting from [this function call](https://github.com/huggingface/diffusers/blob/fc337d585309c4b032e8d0180bea683007219df1/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py#L1064) the `pipeline_stable_diffusion_3.py` file.

Specifically, my dimensions are as follows:
* `latent_model_input` is `torch.Size([10, 16, 128, 128])`
* `timestep` is `torch.Size([10])`
* `prompt_embeds` is `torch.Size([2, 297, 4096])`
* `pooled_prompt_embeds` is `torch.Size([2, 2048])`

Within this function call, the failure stems from [this line in `transformer_sd3.py`](https://github.com/huggingface/diffusers/blob/fc337d585309c4b032e8d0180bea683007219df1/src/diffusers/models/transformers/transformer_sd3.py#L374) which stems for [this line in `embeddings.py`](https://github.com/huggingface/diffusers/blob/fc337d585309c4b032e8d0180bea683007219df1/src/diffusers/models/embeddings.py#L1591). The line is:
```
    conditioning = timesteps_emb + pooled_projections
```

The core of the issue is that `timesteps_emb` is of size `torch.Size([10, 1536])` while `pooled_projections` is of size `torch.Size([2, 1536])` (since it stems from `pooled_prompt_embeds`).

### Reproduction

import gc

import torch
from diffusers import StableDiffusion3Pipeline
from sd_embed.embedding_funcs import get_weighted_text_embeddings_sd3

model_path = "stabilityai/stable-diffusion-3-medium-diffusers"
pipe = StableDiffusion3Pipeline.from_pretrained(model_path, torch_dtype=torch.float16)
pipe.to("cuda")

prompt = """A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus.
This imaginative creature features the distinctive, bulky body of a hippo,
but with a texture and appearance resembling a golden-brown, crispy waffle.
The creature might have elements like waffle squares across its skin and a syrup-like sheen.
It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting,
possibly including oversized utensils or plates in the background.
The image should evoke a sense of playful absurdity and culinary fantasy.
"""

neg_prompt = """\
skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),\
(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),\
extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),\
(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,\
bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),\
(normal quality:2),lowres,((monochrome)),((grayscale))
"""

(prompt_embeds, prompt_neg_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds) = get_weighted_text_embeddings_sd3(
    pipe, prompt=prompt, neg_prompt=neg_prompt
)

image = pipe(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=prompt_neg_embeds,
    pooled_prompt_embeds=pooled_prompt_embeds,
    negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
    num_images_per_prompt=5,
    num_inference_steps=30,
    height=1024,
    width=1024 + 512,
    guidance_scale=4.0,
    generator=torch.Generator("cuda").manual_seed(2),
).images[0]
image.save("sd3_waffle_hippo.png")

del prompt_embeds, prompt_neg_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds
pipe.to("cpu")
gc.collect()
torch.cuda.empty_cache()

### Logs

```shell
This should fail with the following error:
RuntimeError: The size of tensor a (10) must match the size of tensor b (2) at non-singleton dimension 0
```

### System Info

- 🤗 Diffusers version: 0.35.1
- Platform: Linux-6.8.0-71-generic-x86_64-with-glibc2.39
- Running on Google Colab?: No
- Python version: 3.12.3
- PyTorch version (GPU?): 2.8.0+cu128 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.34.4
- Transformers version: 4.51.3
- Accelerate version: 0.32.1
- PEFT version: 0.17.1
- Bitsandbytes version: not installed
- Safetensors version: 0.5.3
- xFormers version: not installed
- Accelerator: NVIDIA RTX A6000, 46068 MiB
NVIDIA RTX A6000, 46068 MiB
NVIDIA RTX A6000, 46068 MiB
NVIDIA RTX A6000, 46068 MiB
- Using GPU in script?: NVIDIA RTX A6000
- Using distributed or parallel set-up in script?: just 1 GPU

### Who can help?

@yiyixuxu  @sayakpaul  @DN6  @asomoza 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SD3 failing with tensor mismatch when using negative prompt embeddings & num_images_per_prompt > 1 #12299

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SD3 failing with tensor mismatch when using negative prompt embeddings & num_images_per_prompt > 1 #12299

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions