-
Notifications
You must be signed in to change notification settings - Fork 6.5k
Description
Describe the bug
Hi there
I am attempting to get around the 77 CLIP token limit when using SD3 (Medium). To do this, I am using a package called sd_embed which chunks up my positive and negative prompts and embeds them. Rather than passing the prompt strings to the SD3 pipeline, I am then instead passing the embeddings as:
pipeline_kwargs = dict(
num_images_per_prompt=num_images,
num_inference_steps=self.num_inference_steps,
height=self.image_height,
width=self.image_width,
guidance_scale=self.guidance_scale,
generator=generator,
**kwargs,
)
if use_embeddings:
response = self.pipeline( # type: ignore
**pipeline_kwargs,
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_prompt_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
)
However, I am getting a tensor mismatch error which seems to be stemming from the fact that I am num_images_per_prompt > 1 (it doesn't give me any issues when num_images_per_prompt = 1).
I have traced the error as starting from this function call the pipeline_stable_diffusion_3.py file.
Specifically, my dimensions are as follows:
latent_model_inputistorch.Size([10, 16, 128, 128])timestepistorch.Size([10])prompt_embedsistorch.Size([2, 297, 4096])pooled_prompt_embedsistorch.Size([2, 2048])
Within this function call, the failure stems from this line in transformer_sd3.py which stems for this line in embeddings.py. The line is:
conditioning = timesteps_emb + pooled_projections
The core of the issue is that timesteps_emb is of size torch.Size([10, 1536]) while pooled_projections is of size torch.Size([2, 1536]) (since it stems from pooled_prompt_embeds).
Reproduction
import gc
import torch
from diffusers import StableDiffusion3Pipeline
from sd_embed.embedding_funcs import get_weighted_text_embeddings_sd3
model_path = "stabilityai/stable-diffusion-3-medium-diffusers"
pipe = StableDiffusion3Pipeline.from_pretrained(model_path, torch_dtype=torch.float16)
pipe.to("cuda")
prompt = """A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus.
This imaginative creature features the distinctive, bulky body of a hippo,
but with a texture and appearance resembling a golden-brown, crispy waffle.
The creature might have elements like waffle squares across its skin and a syrup-like sheen.
It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting,
possibly including oversized utensils or plates in the background.
The image should evoke a sense of playful absurdity and culinary fantasy.
"""
neg_prompt = """
skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),
(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),
extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),
(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,
bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),
(normal quality:2),lowres,((monochrome)),((grayscale))
"""
(prompt_embeds, prompt_neg_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds) = get_weighted_text_embeddings_sd3(
pipe, prompt=prompt, neg_prompt=neg_prompt
)
image = pipe(
prompt_embeds=prompt_embeds,
negative_prompt_embeds=prompt_neg_embeds,
pooled_prompt_embeds=pooled_prompt_embeds,
negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
num_images_per_prompt=5,
num_inference_steps=30,
height=1024,
width=1024 + 512,
guidance_scale=4.0,
generator=torch.Generator("cuda").manual_seed(2),
).images[0]
image.save("sd3_waffle_hippo.png")
del prompt_embeds, prompt_neg_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds
pipe.to("cpu")
gc.collect()
torch.cuda.empty_cache()
Logs
This should fail with the following error:
RuntimeError: The size of tensor a (10) must match the size of tensor b (2) at non-singleton dimension 0System Info
- 🤗 Diffusers version: 0.35.1
- Platform: Linux-6.8.0-71-generic-x86_64-with-glibc2.39
- Running on Google Colab?: No
- Python version: 3.12.3
- PyTorch version (GPU?): 2.8.0+cu128 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.34.4
- Transformers version: 4.51.3
- Accelerate version: 0.32.1
- PEFT version: 0.17.1
- Bitsandbytes version: not installed
- Safetensors version: 0.5.3
- xFormers version: not installed
- Accelerator: NVIDIA RTX A6000, 46068 MiB
NVIDIA RTX A6000, 46068 MiB
NVIDIA RTX A6000, 46068 MiB
NVIDIA RTX A6000, 46068 MiB - Using GPU in script?: NVIDIA RTX A6000
- Using distributed or parallel set-up in script?: just 1 GPU