-
Notifications
You must be signed in to change notification settings - Fork 18
Description
For lack of a better word; I've noticed during training that the VitGAN tends to get stuck on one, two, or three (i don't see four happen very often/at all) "positional blobs" for lack of better words.
Does this match your experience? Effectively what I'm see is that the VitGAN needs to slide from one generation to the next in its latent space. In doing so - it seems to find that it's easier to just sort of create two "spots" in the image that are highly likely to contain specific concepts from each caption.
Does this match your experience? Any idea if this is bad/good? In my experience with the "chimera" examples; it seems to hurt things.
I hope you can see what I mean - there's a position in particular that seems designated for the "head" of the animal. But it also biases the outputs from other captions as well; for instance -
tri - x 4 0 0 tx a cylinder made of coffee beans . a cylinder with the texture of coffee beans .



