Add support for Multiple ControlNetXSAdapters in SDXL pipeline #12100

naomili0924 · 2025-08-08T00:01:49Z

What does this PR do?

This PR is addressing the feature request from an open good-first issue: #8434 It extends the current controlnet adaptor logic to support multiple controlnet adaptors injected into diffusion model.

Before this change, StableDiffusionXLControlNetXSPipeline loads UNet base model and only supports a single point injection from only one controlnet, as shown below.


pipe = StableDiffusionXLControlNetXSPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", controlnet=depth_controlnet, torch_dtype=torch.float16
).to("cuda")

With this change, we allows the StableDiffusionXLControlNetXSPipeline to take a new UnetConditionModel named MultiControlUnetConditionModel to load weights from multiple controlnets and inject every single controlnet output to base model through zero convolution layers.

pipe = StableDiffusionXLControlNetXSPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", controlnet=[depth_controlnet, canny_controlnet], torch_dtype=torch.float16
).to("cuda")

Due to we didn't find any test repo to verify this change, we provide the following code and used it to verify our change (not included in this repo).

import torch
from diffusers.models.autoencoders import AutoencoderKL
from diffusers.models.controlnets import ControlNetXSAdapter
from diffusers.pipelines.controlnet_xs import StableDiffusionXLControlNetXSPipeline
from PIL import Image
from diffusers.utils import load_image
import cv2
import numpy as np

vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
canny_controlnet = ControlNetXSAdapter.from_pretrained(
    "UmerHA/Testing-ConrolNetXS-SDXL-canny", torch_dtype=torch.float16
).to("cuda")
depth_controlnet = ControlNetXSAdapter.from_pretrained(
    "UmerHA/Testing-ConrolNetXS-SDXL-depth", torch_dtype=torch.float16
).to("cuda")

pipe = StableDiffusionXLControlNetXSPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", controlnet=[depth_controlnet, canny_controlnet], torch_dtype=torch.float16
).to("cuda")

# Load your conditioning images
raw_image = load_image("/content/drive/MyDrive/diffusers/src/test.jpg").convert("RGB")

# Generate Canny edge
def get_canny(image):
    image_np = np.array(image.resize((512, 512)))
    image_gray = cv2.cvtColor(image_np, cv2.COLOR_RGB2GRAY)
    edges = cv2.Canny(image_gray, 100, 200)
    edges = np.stack([edges] * 3, axis=-1)  # Make it 3-channel
    return Image.fromarray(edges)

# Generate a fake depth map (in real cases use a depth estimator)
def get_fake_depth(image):
    image_np = np.array(image.resize((512, 512)))
    gray = cv2.cvtColor(image_np, cv2.COLOR_RGB2GRAY)
    depth = np.stack([gray] * 3, axis=-1)
    return Image.fromarray(depth)

canny_image = get_canny(raw_image)
depth_image = get_fake_depth(raw_image)

# Run inference
prompt = "A flower"
output = pipe(
    prompt=prompt,
    image=[canny_image, depth_image],  # order matches controlnet list
    controlnet_conditioning_scale = [0.2, 0.8],
    num_inference_steps=30,
    generator=torch.manual_seed(0),
)

output.images[0].save("output.png")

Design Details:

Stage 1: Prepare embeddings for base Unet and ControlNets

Each controlnet has its own controlnet_cond_embedding module and control_to_base_for_conv_in module to calculate control embeddings and add the embedding onto h_base.

With the new change, we will have one h_base (input for base unet model) and a list of h_ctrls (inputs for controlnets) with the same length of controlnets after this stage.

Stage 2: Up and Mid Unet and ControlNet blocks

Each controlnet has its own base_to_control and control_to_base convolution layers, and the number of base_to_control and control_to_base layer is the same as the number of base UNet layers.

For each layer, we concate h_ctrl with b2c(h_base) as the input for controlnet. We probability need to retrain the b2c model because the h_base is a linear combination of original h_base and all the h_ctrl from all the controlnets. (previously, h_base is only contains h_base and one controlnet output).

After each resnet and attention block, we add weighted linear combination c2b(h_ctrls) to the h_base.

After this stage, we're going to have one h_base and a list of h_ctrl as the stage one.

Stage 3: Decoding Stage

In the decoding stage, we only use control_to_base convolution layers from each controlnet.

In the following image, zero convolution layers from each controlnet are grouped by layer, the zero convolution layers with the same dashed colors are in the same group. The residual output are connected with lines in the same color.

For each layer, the controlnet residual were added to the h_base by zero convolution layers and weighted.

After adding weighted controlnet residuals, the h_base were passed to resnet and attention model to decode images. We won't use and actually sometimes we don't have the resnet+attention blocks from each controlnet upblocks here.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

DN6 · 2025-08-12T03:30:32Z

Hi @naomili0924 thank you for putting the time into this, unfortunately we had to deprecate ControlnetXS due to low usage (You'll notice that the pipeline inherits from DeprecatedPipelineMixin.). We are not actively updating/adding features to it at this time.

naomili0924 added 2 commits August 7, 2025 23:49

Add support for list of ControlNetXSAdapter in SDXL pipeline

1b46b06

Merge branch 'main' into unet_multicontrolnets_xs

dda0d53

naomili0924 changed the title ~~Add support for list of ControlNetXSAdapters in SDXL pipeline~~ Add support for Multiple ControlNetXSAdapters in SDXL pipeline Aug 8, 2025

naomili0924 mentioned this pull request Aug 8, 2025

Support multiple control nets in the StableDiffusionControlNetXSPipeline/StableDiffusionXLControlNetXSPipeline #8434

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for Multiple ControlNetXSAdapters in SDXL pipeline #12100

Add support for Multiple ControlNetXSAdapters in SDXL pipeline #12100

Uh oh!

naomili0924 commented Aug 8, 2025 •

edited

Loading

Uh oh!

DN6 commented Aug 12, 2025

Uh oh!

Uh oh!

Add support for Multiple ControlNetXSAdapters in SDXL pipeline #12100

Are you sure you want to change the base?

Add support for Multiple ControlNetXSAdapters in SDXL pipeline #12100

Uh oh!

Conversation

naomili0924 commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

DN6 commented Aug 12, 2025

Uh oh!

Uh oh!

naomili0924 commented Aug 8, 2025 •

edited

Loading