fix offloading for sd3.5 controlnets #10072

yiyixuxu · 2024-12-01T20:25:28Z

this PR fix the offloading for sd3.5 controlnet, so it behaviors similar to other controlnet, (controlnet + transformer both stay in GPU during the entire denosing loop). Let me know if it is ok @vladmandic, if you want to offload/load controlnet on each iteration, I'm happy to run the experiment to see what's the trade off too, I think it will be very slow
I should have considered the offloading/device map use case and added the pos_embed weights to controlnet checkpoint so we do not handle this specially like this; made a note about this
device_map is not working for all controlnet, (all use cases when an components are passed to the from_pretrained). I will fix this in a follow up PR

import torch
from diffusers import StableDiffusion3ControlNetPipeline, SD3ControlNetModel
from diffusers.utils import load_image


def run_pipeline(test="offload"):
    print(" ")
    print(f"testing pipeline with {test}")

    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

    print("Loading pipeline")
    controlnet = SD3ControlNetModel.from_pretrained("stabilityai/stable-diffusion-3.5-large-controlnet-depth", torch_dtype=torch.float16)
    pipe = StableDiffusion3ControlNetPipeline.from_pretrained(
        "stabilityai/stable-diffusion-3.5-large",
        controlnet=controlnet,
        torch_dtype=torch.float16,
        device_map="balanced" if test=="device_map" else None,
    )

    if test=="offload":
        pipe.enable_model_cpu_offload()
    elif test == "cuda":
        pipe.to("cuda")

    peak_mem = torch.cuda.max_memory_allocated() / 1024**3
    print(f"Peak memory after loading: {peak_mem:.2f} GB")
    

    control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_depth.png")
    generator = torch.Generator(device="cpu").manual_seed(0)
 
    print("Running pipeline")
    try:
        image = pipe(
            prompt = "a photo of a man", 
            control_image=control_image, 
            guidance_scale=4.5,
            num_inference_steps=40,
            generator=generator,
            max_sequence_length=77,
        ).images[0]
        final_peak_mem = torch.cuda.max_memory_allocated() / 1024**3
        print(f"Peak memory after inference: {final_peak_mem:.2f} GB")
        image.save(f'yiyi_test_1_{test}_out.png')
    except Exception as e:
        print(e)
        for n,m in pipe.components.items():
            if isinstance(m, torch.nn.Module):
                print(f" -{n}: {m.device}")
        print(f" ")
    


run_pipeline(test="offload")
run_pipeline(test="cuda")
# run_pipeline(test="device_map")

testing pipeline with offload
Loading pipeline
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.28it/s]
Loading pipeline components...:  89%|███████████████████████████████████████████████████████████████████████████████████████           | 8/9 [00:04<00:00,  1.90it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:04<00:00,  1.91it/s]
Peak memory after loading: 0.00 GB
Running pipeline
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [01:18<00:00,  1.97s/it]
Peak memory after inference: 22.72 GB
 
testing pipeline with cuda
Loading pipeline
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.12it/s]
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:05<00:00,  1.57it/s]
Peak memory after loading: 32.28 GB
Running pipeline
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:30<00:00,  1.31it/s]
Peak memory after inference: 35.37 GB

HuggingFaceDocBuilderDev · 2024-12-01T20:31:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vladmandic · 2024-12-02T17:01:07Z

thanks @yiyixuxu! did a quick test and works as expected

* add

add

86a557d

yiyixuxu requested a review from DN6 December 1, 2024 21:20

DN6 approved these changes Dec 2, 2024

View reviewed changes

update

8098ea2

yiyixuxu merged commit cd34439 into main Dec 2, 2024
18 checks passed

yiyixuxu deleted the controlnet-offload branch December 2, 2024 20:11

lawrence-cj pushed a commit to lawrence-cj/diffusers that referenced this pull request Dec 4, 2024

fix offloading for sd3.5 controlnets (huggingface#10072)

beae716

* add

sayakpaul pushed a commit that referenced this pull request Dec 23, 2024

fix offloading for sd3.5 controlnets (#10072)

72c809e

* add

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix offloading for sd3.5 controlnets #10072

fix offloading for sd3.5 controlnets #10072

Uh oh!

yiyixuxu commented Dec 1, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Dec 1, 2024

Uh oh!

vladmandic commented Dec 2, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

fix offloading for sd3.5 controlnets #10072

fix offloading for sd3.5 controlnets #10072

Uh oh!

Conversation

yiyixuxu commented Dec 1, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Dec 1, 2024

Uh oh!

vladmandic commented Dec 2, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants