Skip to content

Conversation

@yiyixuxu
Copy link
Collaborator

@yiyixuxu yiyixuxu commented Dec 1, 2024

  1. this PR fix the offloading for sd3.5 controlnet, so it behaviors similar to other controlnet, (controlnet + transformer both stay in GPU during the entire denosing loop). Let me know if it is ok @vladmandic, if you want to offload/load controlnet on each iteration, I'm happy to run the experiment to see what's the trade off too, I think it will be very slow
  2. I should have considered the offloading/device map use case and added the pos_embed weights to controlnet checkpoint so we do not handle this specially like this; made a note about this
  3. device_map is not working for all controlnet, (all use cases when an components are passed to the from_pretrained). I will fix this in a follow up PR
import torch
from diffusers import StableDiffusion3ControlNetPipeline, SD3ControlNetModel
from diffusers.utils import load_image


def run_pipeline(test="offload"):
    print(" ")
    print(f"testing pipeline with {test}")

    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

    print("Loading pipeline")
    controlnet = SD3ControlNetModel.from_pretrained("stabilityai/stable-diffusion-3.5-large-controlnet-depth", torch_dtype=torch.float16)
    pipe = StableDiffusion3ControlNetPipeline.from_pretrained(
        "stabilityai/stable-diffusion-3.5-large",
        controlnet=controlnet,
        torch_dtype=torch.float16,
        device_map="balanced" if test=="device_map" else None,
    )

    if test=="offload":
        pipe.enable_model_cpu_offload()
    elif test == "cuda":
        pipe.to("cuda")

    peak_mem = torch.cuda.max_memory_allocated() / 1024**3
    print(f"Peak memory after loading: {peak_mem:.2f} GB")
    

    control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/marigold/marigold_einstein_lcm_depth.png")
    generator = torch.Generator(device="cpu").manual_seed(0)
 
    print("Running pipeline")
    try:
        image = pipe(
            prompt = "a photo of a man", 
            control_image=control_image, 
            guidance_scale=4.5,
            num_inference_steps=40,
            generator=generator,
            max_sequence_length=77,
        ).images[0]
        final_peak_mem = torch.cuda.max_memory_allocated() / 1024**3
        print(f"Peak memory after inference: {final_peak_mem:.2f} GB")
        image.save(f'yiyi_test_1_{test}_out.png')
    except Exception as e:
        print(e)
        for n,m in pipe.components.items():
            if isinstance(m, torch.nn.Module):
                print(f" -{n}: {m.device}")
        print(f" ")
    


run_pipeline(test="offload")
run_pipeline(test="cuda")
# run_pipeline(test="device_map")
testing pipeline with offload
Loading pipeline
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.28it/s]
Loading pipeline components...:  89%|███████████████████████████████████████████████████████████████████████████████████████           | 8/9 [00:04<00:00,  1.90it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:04<00:00,  1.91it/s]
Peak memory after loading: 0.00 GB
Running pipeline
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [01:18<00:00,  1.97s/it]
Peak memory after inference: 22.72 GB
 
testing pipeline with cuda
Loading pipeline
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.12it/s]
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:05<00:00,  1.57it/s]
Peak memory after loading: 32.28 GB
Running pipeline
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:30<00:00,  1.31it/s]
Peak memory after inference: 35.37 GB

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@yiyixuxu yiyixuxu requested a review from DN6 December 1, 2024 21:20
@vladmandic
Copy link
Contributor

thanks @yiyixuxu! did a quick test and works as expected

@yiyixuxu yiyixuxu merged commit cd34439 into main Dec 2, 2024
18 checks passed
@yiyixuxu yiyixuxu deleted the controlnet-offload branch December 2, 2024 20:11
lawrence-cj pushed a commit to lawrence-cj/diffusers that referenced this pull request Dec 4, 2024
sayakpaul pushed a commit that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants