How exactly does Conditioning (Concat) work on a lower level? #1436
Replies: 2 comments 2 replies
-
I'd love to know more about this as well. |
Beta Was this translation helpful? Give feedback.
-
Last I checked, it just does this: https://pytorch.org/docs/stable/generated/torch.cat.html Normally, stablediffusion works by turning your entire prompt into a vector embedding for it to understand, but AI is stupid and doesn't understand things very well because it smooshes everything together and sometimes will bleed words/concepts into places where it was not specified. Concat lets you break the prompt into "chunks" by making them separate entries. It's very useful for things like colors or character composition. Example: if you combine 2 and 3, you get 5. But, if you give someone 5, they won't know that you started with 2 and 3, so they'll have a tendency to only make 5. I made a node setup that allows you to test prompts for a better visual understanding of how it functions: https://civitai.com/models/230634?modelVersionId=261739 If you want some advice on how to learn these things: In your example with the dog, you essentially told the AI that unconditional conditioning vectors were to be taken into consideration alongside the "Dog" prompt. I think this is why the composition improved overall, but it's just speculation on my part (Limbs were in correct place and paws were anatomically correct. I don't think that is coincidence that giving the AI more freedom allowed it to clean up the image). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
There isn't much documentation about the Conditioning (Concat) node. With it, you can bypass the 77 token limit passing in multiple prompts (replicating the behavior from the BREAK token used in Automatic1111 ), but how do these prompts actually interact with each other? Will Stable Diffusion:
I ran a few A/B tests to get a better idea of what is happening under the hood, but I don't have a good answer so far.
For example, this is the result of a simple prompt of "a dog":

This is the same seed and hyperparameters but with "a dog" concatted with an empty CLIP Text Encode

Doing the same thing but with "a dog" concatted with an empty CLIP Text Encode 4x

What is interesting is that not only does the image change with the inclusion of the condition of an empty string, but that it also changes to a much more minute degree when the empty string is passed multiple times. How can this be explained exactly? I guess a related question is also does an empty string restrict the possible outputs of a model and exert its own bias to the distribution of possible images?
Beta Was this translation helpful? Give feedback.
All reactions