Photography Best Practices for SD #5305

josephbenton · 2022-12-01T19:44:35Z

josephbenton
Dec 1, 2022

One thing I noticed about SD output is that it's truly missing some of the simple things that make photography really great. Two of them are framing and lighting.

For example, just telling SD to put one or both eyes on the top third of the frame would significantly improve quality of output. Much of the time, we've got bodies with heads cut off at the neck. This would eliminate that problem completely for all "shapes" (human, animal, etc.) that have eyes. It would be much more complicated to add in some rules about rendering the frame around the subject where the frame would bisect their joints (shoulders and knees are common locations for this).

When rendering subjects that don't have eyes, where you want the focus of the piece to draw attention to the subject, your subject's center would be placed at a bottom third, right third or left third. This is best when the subject is small such as the nugget of gold in a piece of rock, the sunset, or a mouse. There may be times where the subject would be best at the top third such as a moon or sun in the sky.

Another good way to increase the control (and thus quality) of output is to place lighting such as the sun. If there were parameters to control sun altitude (100 would be high noon overhead sun, 10 would be sunset at the horizon and 5 would be dusk and 0 would be night) and placement in relation to the subject (front would be 0 degrees, right would be 90 degrees, back would be 180, left of subject would be 270).

This is my first time posting anything on Github so please pardon me for my noobishness.

some9000 · 2022-12-01T21:45:29Z

some9000
Dec 1, 2022

These are valid points from an artistic perspective for sure. However, stable diffusion is not, in any way, an intelligent system. It has a certain "understanding" of how words relate to images and a very clever way of converting that into visible results.

Problem is, due to technical reasons, the way it was originally trained was an enormous pile of initially 95% garbage images, which were then cropped to squares without any subject centering. So most standard 2:3, 3:4 or even 9:16 ratio portraits got top of the head cut off. Hence all that crappy framing, lacking lighting and other issues (stock image watermarks, yay!)

Edit: A little more info on this. You can see and search some (12096835 million to be exact (at the moment of writing)) of the images (or type of, who knows what they filtered and how) that were used to train the system here: https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/images
Given what comes up with keywords there, it is absolutely amazing that we are getting the results that we are at all.

It seems like the most realistic way of getting decent results is relying on custom trained models. Other than that, filtering in good examples and filtering out bad ones with keywords can help. But it is not a completely reliable solution.

Edit: The way you word it, sounds like a 3D system that arranges content based on keywords then letting you light and frame it nicely. That may be possible in the future (Nvidia has some early examples of single objects being generated that way), but for now that level of control seems a good distance away.

1 reply

Luke2642 Dec 1, 2022

Welcome Joseph!

You'll love this model:

https://huggingface.co/dreamlike-art/dreamlike-photoreal-1.0

Miraculix200 · 2022-12-01T21:57:15Z

Miraculix200
Dec 1, 2022

For proper placement exactly as you want it, something like img2img with the inpainting model or depth masks or something would make sense. Like draw a blob somewhere on a white canvas and then run that through img2img, telling it what the blob is supposed to be

0 replies

papersplease · 2022-12-02T00:20:40Z

papersplease
Dec 2, 2022

What you want is precision and control in general. It's a holy grail of image generation, and just like the holy grail it might be unreachable by using text to image alone. Not just because the model doesn't understand the natural language well - I'm sure it will improve greatly, but because treating the model in a functional way (send words - receive image) is inherently limited in explaining artistic intent. Even if you put a real human on the other side, who will never analyze requests or give feedback, just blindly following them. Text to image alone is more of a toy, it's not well suited to getting things done. "Prompt engineering" is a gimmick - it scales poorly, and explaining your intent with higher order tools than text is faster and more convenient than trying to write Finnish in a way that would yield coherent Chinese poems after Google Translate. Which is similar to what txt2img does.

TL;DR: one sketch for img2img is worth a thousand tokens. One depth guide is worth ten thousand.

Now, the closest thing to what you want that you can do right now, is to use the 2.0's depth2img feature. Use any 3D software to quickly setup the rough geometry with directional lighting. Use free assets, there are plenty of royalty-free human rigs etc. Don't bother with details, anatomy, haircuts, physically correct rendering or anything like that - the model will deal with all that, it just needs a rough guide. Then feed the draft render and/or a depth buffer of your 3D scene into the model as a guide for your prompts. To get the style you want, use any of the available training techniques on your reference pictures, not clever prompt tricks as they quickly become too complex to manage if you want control. Reference images can be the output of SD as well! Textual inversion is worth a shot, as despite 2.0 prompt understanding being possibly gimped, all the training images are still there for your styling needs (sans NSFW, it requires Dreambooth and alike), so you'll be able to train your embedding easily.

Thinking of it more, image gen needs some kind of a hybrid 2D/3D software toolkit software with a pipeline like this:

quickly layout a mock-up scene in 3D with rough assets, just like in game engine level editors
train your style on your reference images (which might either be curated output of the model itself, or a specific visual language you constructed)
enter your prompt, which doesn't need to be complex and convoluted as you already provide the precise hints on what you want
let the model render the result

The model can possibly be further combined with something like this for temporal stability, making animation possible. Using tagged geometry as in the aforementioned paper, one could semantically tag the specific objects or areas of the 3D scene to better guide the model ("this is a middle-aged man", "this is a house roof", "this is a Volvo semi", "this is a mirror" etc). This would make complex scenes possible without overloading the textual input. I imagine one would need to train a new model from scratch with such use in mind. Such software would revolutionize content creation, just like other "unconventional" tools like ZBrush revolutionized their respective areas.

3 replies

some9000 Dec 2, 2022

Very impressive writeup. I think it illustrates very well how much work of an artist goes into actually getting solid results. And how far away from that SD currently is. At the same time, as of now, 512x512 or 768x768 trained models aren't exactly a solid source for quality output anyway. SD feels so much more of a inspiration tool from which to take elements, combine them and then produce some sort of a base artwork to further manually improve.

That said. Just like you mentioned, some sort of a very simple "throw elements together to form a scene" tool would be very neat to have. I did this amazing example in the wonderful/terrible MS Paint 3D:

It's garbage. But, getting that arrangement with pure keywords would be so much harder. So, if there was a way to throw things together fast and easy, then getting some decent results that would be neat. Did I just describe that corner stone of early 2000s badly rendered CGI smut Poser? Sounds like it. Or DAZ3D or some mix of Reallusion 2D/3D products as a more modern option.

It is still quite complex, but in some near future, when it will be possible to generate 4K or 8K images, all that work may be worth it. Certainly easier than spending solid days creating things from scratch.

cyan2k Dec 2, 2022

In addition to this post it should be noted that OP's compositional requirements are also doable with a custom trained model.
If you look at the LAION dataset pictures and their text there is basically no information about lighting and very limited compositional concepts like detailed human posing. So if you want a model that understands "overhead lighting" or "spot light at the left illuminating the face" or "human at the right, sun left high in the sky" you just have to train a model with text-image pairs where this information is available.

For example as a POC I wanted to try how good you can make SD understand complex human posing and anatomy. After the insane work of collecting >10.000 yoga and contortion pictures - since yoga poses are complex by nature - and describing those pictures in detail and giving it SD to train on I know have a model with which I can pose a human like however I want them to pose down to the positioning of the hands, and the model does an amazing job adhering to its prompt.

Same would also work with lighting, subject positioning and basically everything else

hippopotamus1000 Dec 4, 2022

Is it possible to get your yoga model? I understand, if you want to keep it for yourself, but it would be great to have a model, that gets bodyparts right.

yspacedev · 2022-12-03T01:04:43Z

yspacedev
Dec 3, 2022

I guess the simple thing is: Stable diffusion is a magic box of math that you throw words into and get images out of. It doesn't have an understanding of what things are, but it knows what words are close in meaning and what images they might represent. And then it knows how to take some gaussian noise, some parameters, and a prompt to somehow make an at-least-somewhat-coherent image.

0 replies

Photography Best Practices for SD #5305

Uh oh!

Replies: 4 comments · 4 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 4 comments 4 replies