Fix small inconsistency in output dimension of "_get_t5_prompt_embeds" function in sd3 pipeline #12531
+7
−7
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR fixes a small inconsistency in the output dimension of the
_get_t5_prompt_embedsfunction in the Stable Diffusion 3 pipeline.Previously, when
self.text_encoder_3wasNone, the function returned a tensor (torch.zeros) with a sequence length ofself.tokenizer_max_length(77), which corresponds to the CLIP encoder. However, the T5 text encoder used in SD3 has a different maximum sequence length (256).As a result, when
text_encoder_3was available, the prompt embeddings had a sequence length of 333 (256 from T5 + 77 from CLIP), but when it was not available, the returned tensor had only 154 (77 + 77), leading to an inconsistency in output dimensions inencode_prompt.Motivation and Context
This change ensures consistent tensor shapes across different encoder availability conditions in the SD3 pipeline.
It prevents dimension mismatches and potential runtime errors when
text_encoder_3isNone.Previously, the zeros tensor used
self.tokenizer_max_length, which corresponds to CLIP, instead of T5’s longer sequence length.This mismatch led to inconsistent embedding dimensions when combining outputs from CLIP and T5 in
encode_prompt.Changes Made
self.tokenizer_max_lengthwithmax_sequence_lengthwhen returning the zero tensor in_get_t5_prompt_embeds, ensuring consistent output dimensions whethertext_encoder_3isNoneor available.The same
max_sequence_lengthparameter is already used in the tokenization step of the same function:Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?