Brainstorming: ideas on how to better control subjects and contexts #3615
Replies: 5 comments 1 reply
-
I like your idea very much. Also thought about that without knowing how to formulate it correctly. This would probably rely on complex prompt parsing and dedicated neural networks as a pre-processing steps ? |
Beta Was this translation helpful? Give feedback.
-
Have been discussing the need for a HandGan with my friend for a little while now, idea is pretty much exactly how gfpgan works just for hands, the problem is I don't have the time to commit to such an endeavor. The idea was to have it live in the extras tab, and instead of a strength slider you would choose a hand pose for each arm from a drop down, gender and race, then draw a small mask at the wrists, sort of like how inpainting works but with a model containing only images of hands in various poses. |
Beta Was this translation helpful? Give feedback.
-
relevant : Training-Free Structured Diffusion Guidance (TFSDG) |
Beta Was this translation helpful? Give feedback.
-
This looks like a basic implementation of that in SD, I wonder the feasibility of including it in the WebUI https://github.com/cloneofsimo/paint-with-words-sd |
Beta Was this translation helpful? Give feedback.
-
nvidia did something similar eDiff-I: Text-to-Image Diffusion Models with Ensemble of Expert Denoisers TL;DR: eDiff-I is a new generation of generative AI content creation tool that offers unprecedented text-to-image synthesis with instant style transfer and intuitive painting with words capabilities.
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
it's no surprise that there are currently a lot of things that diffusion struggles with, such as:
with various proposed solutions, such as hypernetworks, embeddings, etc, but I have a theory of a concept that might also be helpful, and i think a discussion about methods like this might help the community overall as well
Subject Masks
assuming CLIP models never get better (which I hope isn't true, I hope they do improve because they would also improve diffusion, as a side effect), is there anything that we can do to assist the model regardless? I have an idea if we break the task into a few steps:
Example:
let's say we wanted to make an image like this:
"a futuristic astronaut meeting a bizarre blue martian on the deck of a starship overlooking a beautiful alien world"
Separating Prompts:
if we can figure out either an algorithm or syntax to separate our prompts by subject, example: Subject1, Subject2, Background1, Background2, then we can use that information to generate masks to assign to each subject
in this example, I would split it like this
Generating Masks:
once we know how many subjects we have, we want a mask for each

I don't know how these would be made, my best guess would be, since we're generating the composition using noise in the early steps, we could detect the best candidates for each in the noise and progressively refine the mask as the image gets more finished through diffusion/detection passes
I imagine a mask image would look something like this early on, especially if the mask was supplied by a user
where each color represents a different subject, on the implementation side, I don't know how these would be assigned, but if users could modify the weights, it would be pretty intuitive
Using the weights:
once we have our subject components and our weights, we need to figure out how to use them
I hope my understanding of this is good enough that this doesn't just sound like wishful thinking
we want to avoid inefficient use of GPU resources, so we want to try to process all of the passes at once in parallel
so rather than do each mask as a separate pass like img2img, every pixel gets the entire prompt as a vector, but the multiple weights are controlled by the masks (does that make sense?)
so:
in this way, the entire image is processed once, but every pixel has a different weight for each prompt component (another note for implementation side, let users control weights using floats for variable attention?)
the hope is that each masked area effectively only sees that section of the prompt, even though every area gets all segments of the prompt
I was hoping it was possible to have ways to use blurred masks as well by analyzing how close a pixel is to each respective color, but the implementation of that is out of my understanding
This is just a concept I was thinking about, and I hope it's inspiring or makes any sense to anyone, do you think it's possible to implement as written? I understand it's a large task, which is why I wrote it as a discussion rather than an issue, because I wanted to just get the idea out there, I don't expect anyone to actually make it, even if I think it would be useful
Beta Was this translation helpful? Give feedback.
All reactions