Add support for Multiple ControlNetXSAdapters in SDXL pipeline #12100
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR is addressing the feature request from an open good-first issue: #8434 It extends the current controlnet adaptor logic to support multiple controlnet adaptors injected into diffusion model.
Before this change, StableDiffusionXLControlNetXSPipeline loads UNet base model and only supports a single point injection from only one controlnet, as shown below.
With this change, we allows the StableDiffusionXLControlNetXSPipeline to take a new UnetConditionModel named MultiControlUnetConditionModel to load weights from multiple controlnets and inject every single controlnet output to base model through zero convolution layers.
Due to we didn't find any test repo to verify this change, we provide the following code and used it to verify our change (not included in this repo).
Design Details:
Stage 1: Prepare embeddings for base Unet and ControlNets
Each controlnet has its own controlnet_cond_embedding module and control_to_base_for_conv_in module to calculate control embeddings and add the embedding onto h_base.
With the new change, we will have one h_base (input for base unet model) and a list of h_ctrls (inputs for controlnets) with the same length of controlnets after this stage.
Stage 2: Up and Mid Unet and ControlNet blocks
Each controlnet has its own base_to_control and control_to_base convolution layers, and the number of base_to_control and control_to_base layer is the same as the number of base UNet layers.
For each layer, we concate h_ctrl with b2c(h_base) as the input for controlnet. We probability need to retrain the b2c model because the h_base is a linear combination of original h_base and all the h_ctrl from all the controlnets. (previously, h_base is only contains h_base and one controlnet output).
After each resnet and attention block, we add weighted linear combination c2b(h_ctrls) to the h_base.
After this stage, we're going to have one h_base and a list of h_ctrl as the stage one.
Stage 3: Decoding Stage
In the decoding stage, we only use control_to_base convolution layers from each controlnet.
In the following image, zero convolution layers from each controlnet are grouped by layer, the zero convolution layers with the same dashed colors are in the same group. The residual output are connected with lines in the same color.
For each layer, the controlnet residual were added to the h_base by zero convolution layers and weighted.
After adding weighted controlnet residuals, the h_base were passed to resnet and attention model to decode images. We won't use and actually sometimes we don't have the resnet+attention blocks from each controlnet upblocks here.
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.