Training a new ControlNet with Img2Img active during training, or training other image variation models #6676
              
                Unanswered
              
          
                  
                    
                      mayhemsloth
                    
                  
                
                  asked this question in
                Q&A
              
            Replies: 1 comment 3 replies
-
| This might be a useful resource: https://huggingface.co/lambdalabs/sd-image-variations-diffusers | 
Beta Was this translation helpful? Give feedback.
                  
                    3 replies
                  
                
            
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone! This is going to be a long post so I appreciate any response and your time and attention spent on that response.
Tl;dr: I want to train an image variation model that is guided by information in a conditional image instead of a conditional text prompt.
Background and Context: My overall goal is to produce a generative image model that, during inference, takes in
and then outputs a new image (which I'll call the target image) which looks very similar to the starting image but has been changed by the conditional information, injected in the "correct" way, taught during training.
I have arrived at the following plan to accomplish this goal. I plan to train a ControlNet from scratch, using a custom dataset that I will prepare, somewhat similar to the community training example of the circle filling dataset. Initially, for computational reasons, I will use SDv1.5, but will very likely want to migrate to SDXL after proving that it can be done well with the 512 x 512 resolution (64 x 64 x 4 latents) of SDv1.5.
However, I want to utilize a starting image, and not necessarily the text prompt, to heavily influence the output of the final trained model, and so my plan is to use img2img as the initial "starting point" for the latents, instead of true noise, during inference.
Problem: After tracing some of the
train_controlnet.pycode from the example above, the main denoising training loop starts here. The information in the"pixel_values"gets encoded by the VAE to produce latents. Note that this information is the target image, and the purpose of the network is to predict the noise injected into these latents a few lines down, based on information from thecontrolnetoutput and the conditioning of theencoder_hidden_states, which is the text prompt embedding.Remember that what I want is to have a starting image that heavily influences the output, and the target image would be slightly different than the starting image, and the information needed to bridge that gap is contained primarily in the conditional information (in image form), and much less so in the text prompt. There seems to be a pipeline designed for img2img with a ControlNet, but that's only inference and not for training. I searched the diffusers issues and found this question, but the answer didn't explain very well why this doesn't work.
After looking into the forward pass of that pipeline, it looks like the initial latents from the starting image are prepared here, after the starting image is sent through the VAE here. Then in the denoising loop, the latents get pushed through the
controlnetand theunet, and then the predicted noise is extracted from the latents here.I want to be able to change code in
train_controlnet.pysuch that it also accepts a"starting_image"and behaves similarly to the image2image pipeline. However I feel like this might not work how I want it to work due to the math behind why diffusion works in the first place (corrupting the image data distribution to the noise distribution, and undoing the process based on text conditioning). I don't quite understand how I would have the model learn that it's supposed to find the "noise" that exists "between" the"starting_image"and"target_image", if that makes sense. It seems like I would have to inject some amount of noise into the"starting_image"to get it into the "noisy distribution", and then have the denoising process transform it (by predicting and subtracting some noise) into the"target_image"conditioned on the conditional information.I thought img2img might be the best approach to solve this, but in the process of typing this out I think there may be some other options.
Utilize the
encoder_hidden_statesto inject starting image information that has been transformed into a text embedding. Basically unCLIP but with a ControlNet. I don't want to do this because the image-to-text-embedding-back-to-image transformation will necessarily "compress" the image detail, and I can't afford to lose that much information about the starting image.Hijack
encoder_hidden_statesto create my own conditional image encoder and inject information there. This seems impossible without a tremendous amount of compute (which I don't have), as theencoder_hidden_stateshave been trained on a specific embedding space of text.Utilize two ControlNets: the first (StartControlNet) would condition the model on the starting image, and the second (ConditionControlNet) would condition the model on the conditional information. This is straightforwardly easy for me to understand, but ultimately inelegant IMO. Because the ControlNet inputs are the same spatial size as the VAE latent, you could give the ControlNet the direct VAE latent of the starting image in addition to any number of channels from whatever preferred encoded image space you want (whichever VGG19 layer that is x8 smaller than the initial input, for example). Thus StartControlNet would be trained by letting the starting image become the target image, corrupting the target image (starting image now) with noise, and then figuring out how to denoise the target image (starting image) based on the conditional information from StartControlNet (which has been passed the starting image information, so it should be very easy!). So then with StartControlNet I have a model that can, ideally, turn random noise into whatever image I give StartControlNet. Awesome. This net would effectively be a "bias" term to force the final model to be very close to the starting image (being sent into StartControlNet).
The next step would be to train ConditionControlNet, which goes through basically the same process as StartControlNet, but has StartControlNet inputs being added during the training of ConditionControlNet. I claim this is inelegant because it's like 2 training stages (feelsbadman.jpg), and I think I would have to arbitrarily affix the weighting from each ControlNet with respect to each other. I guess if I unfreeze the weights of StartControlNet during training of ConditionControlNet then the model would overall learn how to weight them with respect to each other, AND I would retain control later to change the weighting if necessary. The ConditionControlNet would thus be trained by having the StartControlNet take as input the starting image (encoded/embedded properly), the ConditionControlNet take as input the conditional image/images (encoded/embedded properly) and then the target being the proper target image, such that the overall model has to learn how to, ideally, change random noise into the target image while being conditioned by the starting image and the conditional image via the StartControlNet and ConditionControlNet, respectively. The good thing about this is technically you can do multiple ConditionControlNet during future inference if you have multiple conditional images you want to control with.
Do proposed 3) above, but just stack channels into exactly one ControlNet (StartConditionControlNet) such that you combine both the starting image and the conditioning image into one input, and allow the zero conv 1x1 layers to figure out which information is important to inject when. During training, some percentage of the time you can corrupt the conditioning image and change the target image to the starting image, to force the model to learn that you need to pay attention to both pieces of information and the context between them.
Do something else??? I mainly want to use cross attention to directly attend to the conditional image information from the starting image as a means to getting to the target image.
If you as a reader feel that I am being too vague, it's purposeful. At the moment, I don't want to give too much away publicly about what I'm doing. Thanks for any help!
Beta Was this translation helpful? Give feedback.
All reactions