Brainstorming: ideas on how to better control subjects and contexts #3615

Kavukamari · 2022-10-25T04:10:07Z

Kavukamari
Oct 25, 2022

it's no surprise that there are currently a lot of things that diffusion struggles with, such as:

multiple subjects
assigning details to specific subjects
background/subject priority/separation
subject cohesion

with various proposed solutions, such as hypernetworks, embeddings, etc, but I have a theory of a concept that might also be helpful, and i think a discussion about methods like this might help the community overall as well

Subject Masks

assuming CLIP models never get better (which I hope isn't true, I hope they do improve because they would also improve diffusion, as a side effect), is there anything that we can do to assist the model regardless? I have an idea if we break the task into a few steps:

Separate Prompts into component Subjects
Generate Subject Mask Pass
Multiply Subject Component Weights by Masks during processing

Example:

let's say we wanted to make an image like this:
"a futuristic astronaut meeting a bizarre blue martian on the deck of a starship overlooking a beautiful alien world"

Separating Prompts:

if we can figure out either an algorithm or syntax to separate our prompts by subject, example: Subject1, Subject2, Background1, Background2, then we can use that information to generate masks to assign to each subject
in this example, I would split it like this

Subject: a futuristic astronaut
Subject: a bizarre blue martian
Subject (background): the deck of a starship
Subject (background): a beautiful alien world

Generating Masks:

once we know how many subjects we have, we want a mask for each
I don't know how these would be made, my best guess would be, since we're generating the composition using noise in the early steps, we could detect the best candidates for each in the noise and progressively refine the mask as the image gets more finished through diffusion/detection passes
I imagine a mask image would look something like this early on, especially if the mask was supplied by a user

where each color represents a different subject, on the implementation side, I don't know how these would be assigned, but if users could modify the weights, it would be pretty intuitive

Using the weights:

once we have our subject components and our weights, we need to figure out how to use them
I hope my understanding of this is good enough that this doesn't just sound like wishful thinking
we want to avoid inefficient use of GPU resources, so we want to try to process all of the passes at once in parallel
so rather than do each mask as a separate pass like img2img, every pixel gets the entire prompt as a vector, but the multiple weights are controlled by the masks (does that make sense?)
so:

for p=[a futuristic astronaut meeting, a bizarre blue martian, on the deck of a starship, overlooking a beautiful alien world]
orange=p*[1,0,0,0]
purple=p*[0,1,0,0]
grey=p*[0,0,1,0]
sky=p*[0,0,0,1]

in this way, the entire image is processed once, but every pixel has a different weight for each prompt component (another note for implementation side, let users control weights using floats for variable attention?)
the hope is that each masked area effectively only sees that section of the prompt, even though every area gets all segments of the prompt

I was hoping it was possible to have ways to use blurred masks as well by analyzing how close a pixel is to each respective color, but the implementation of that is out of my understanding

This is just a concept I was thinking about, and I hope it's inspiring or makes any sense to anyone, do you think it's possible to implement as written? I understand it's a large task, which is why I wrote it as a discussion rather than an issue, because I wanted to just get the idea out there, I don't expect anyone to actually make it, even if I think it would be useful

Ehplodor · 2022-10-25T07:45:25Z

Ehplodor
Oct 25, 2022

I like your idea very much. Also thought about that without knowing how to formulate it correctly. This would probably rely on complex prompt parsing and dedicated neural networks as a pre-processing steps ?

0 replies

B34STW4RS · 2022-10-25T12:17:22Z

B34STW4RS
Oct 25, 2022

Have been discussing the need for a HandGan with my friend for a little while now, idea is pretty much exactly how gfpgan works just for hands, the problem is I don't have the time to commit to such an endeavor. The idea was to have it live in the extras tab, and instead of a strength slider you would choose a hand pose for each arm from a drop down, gender and race, then draw a small mask at the wrists, sort of like how inpainting works but with a model containing only images of hands in various poses.

0 replies

Ehplodor · 2022-10-26T09:38:32Z

Ehplodor
Oct 26, 2022

relevant : Training-Free Structured Diffusion Guidance (TFSDG)
please see #2940 at the corresponding paragraphs (not implemented) for some links

1 reply

Kavukamari Oct 26, 2022
Author

thanks, I'll give that a read, it looks promising

iterhating · 2022-11-07T15:46:25Z

iterhating
Nov 7, 2022

This looks like a basic implementation of that in SD, I wonder the feasibility of including it in the WebUI https://github.com/cloneofsimo/paint-with-words-sd

0 replies

W1Real · 2022-12-18T17:22:56Z

W1Real
Dec 18, 2022

nvidia did something similar

eDiff-I: Text-to-Image Diffusion Models with Ensemble of Expert Denoisers

TL;DR: eDiff-I is a new generation of generative AI content creation tool that offers unprecedented text-to-image synthesis with instant style transfer and intuitive painting with words capabilities.

@article{balaji2022eDiff-I,
    title={eDiff-I: Text-to-Image Diffusion Models with Ensemble of Expert Denoisers},
    author={Yogesh Balaji and Seungjun Nah and Xun Huang and Arash Vahdat and Jiaming Song and Karsten Kreis and Miika Aittala and Timo Aila and Samuli Laine and Bryan Catanzaro and Tero Karras and Ming-Yu Liu},
    journal={arXiv preprint arXiv:2211.01324},
    year={2022}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Brainstorming: ideas on how to better control subjects and contexts #3615

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Brainstorming: ideas on how to better control subjects and contexts #3615

Uh oh!

Kavukamari Oct 25, 2022

Subject Masks

Example:

Separating Prompts:

Generating Masks:

Using the weights:

Replies: 5 comments · 1 reply

Uh oh!

Uh oh!

Ehplodor Oct 25, 2022

Uh oh!

B34STW4RS Oct 25, 2022

Uh oh!

Uh oh!

Ehplodor Oct 26, 2022

Uh oh!

Kavukamari Oct 26, 2022 Author

Uh oh!

iterhating Nov 7, 2022

Uh oh!

W1Real Dec 18, 2022

Kavukamari
Oct 25, 2022

Replies: 5 comments 1 reply

Ehplodor
Oct 25, 2022

B34STW4RS
Oct 25, 2022

Ehplodor
Oct 26, 2022

Kavukamari Oct 26, 2022
Author

iterhating
Nov 7, 2022

W1Real
Dec 18, 2022