I wanted to make a Beginners Primer: Are these good analogies? #2124
Replies: 2 comments 1 reply
-
To understand VAEs, you need to understand the difference between a pixel representation of an image, and the tensors in the latent image representation. The pixel images SD works with are R,G,B x 512 x 512 = 750k values. Rather than take these directly into the UNet, SD does a lossy compression step to reduce the dimensionality of the problem, and to reshape the information into a better format. This is the VAEncoder. The output is 4x64x64 = 16k values. Smaller and doesn't have that awkward non-power-of-2 in one of the dimensions (3 - RGB). Then, after the image generation, the VADecoder decodes the latent image to a normal image. If I was to describe it to a layman, I think I'd describe it as a lossy compression / decompression like JPEG. It reduces the amount of redundant information that the Unet has to deal with, and the Unet has been trained to work in this "compressed" language of images. The reason it effects image fidelity is just because each VAE places different importance to different image traits. (e.g. colour precision of clothing over fine detail of skin texture) I really recommend this video |
Beta Was this translation helpful? Give feedback.
-
Thanks for the response. I actually know the VAE technical details already (my fault for not including that info). The issue I was having was figuring out how to explain VAE to artists. My reddit experience shows that they have a very visceral reaction to math, or words beyond a 5th grade reading comprehension. I was thinking of the Image Compression analogy, but I am legitimately worried that data compression is too complex of a subject for them to understand. The purpose of these analogies is so that they can have a better understanding of when they need or want to switch some of the components (like the model or CLIP). But since I haven't been able to find any meaningful differences in switching VAE aside from color saturation, it's difficult for me to come up with a simple explanation as to why someone would want to change the VAE. I felt like the analogies for UNET and CLIP were good and can demonstrate what will happen if you change those things. I'm actually still thinking that the "Color Grading" analogy is good for VAE since that is the only difference that an artist will notice when swapping. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Heyo, I'm in the process of making a tutorial that covers some stuff that (I feel) all ComfyUI users should know.
The 3 basic aspects I wanted to cover were UNET, Text Encoder, and VAE. I wanted to post here to make sure I didn't get anything wrong and for feedback.
The CLIP node is basically the Text Encoder portion of the model. It converts the string of text in your prompt into tokens, and then uses those tokens to create a "Vector Embedding" (Conditioning) based on the words you used and their relative positions in the string. This is important, because "Girl, church" will give a drastically different image than "Church, girl"
Analogy for Text Encode is when an artist decides "I am going to draw a moon in the night sky." The artist has an IDEA of what they want to draw, but they haven't actually started drawing anything yet.
UNET is the "Model" nodes. If you have ever heard technical jargon that says "StableDiffusion works by Denoising" then the UNET is the part that does this "denoising." It takes the "Vector Embeddings" (conditioning) and uses some fancy math called "Cross Attention," which basically means "It just works." I think the analogy for UNET is the actual DRAWING of the image. Since the artist already decided in the last step they were going to draw a moon, the UNET is the part where they actually draw the moon.
Vae I struggle with honestly. I have experimented with a few different VAE and the differences were usually negligible, a couple of pixels different, or just different color saturation. The VAE is actually very important because it is able to decode the latent into a pixel space, and vice versa, but I struggle to understand the differences between them, or even justifying using anything other than that VAE everyone loves to use (840000-MSe whatever. You know the one). The only analogy I could come up for this was that it basically acts like the "Color Grade/Clarity/Sharpness" passes that artists use at the end of their workflow.
If the latent is being ENCODED (source image) then it would be similar to having a certain exposure profile when taking a photograph. I think these VAE ones are bad analogies because I don't really understand this part. All I know is that it encodes/decodes latent so explaining it in a meaningful way is difficult.
I basically want to explain these basic concepts without having to go into any math details so that creatively oriented folks can better understand what ANY of these things are actually doing.
Any feedback/advice greatly appreciated. This is a monumental task.
Beta Was this translation helpful? Give feedback.
All reactions