Skip to content

Conversation

@gameofdimension
Copy link
Contributor

What does this PR do?

Fixes # (issue)

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@bghira
Copy link
Contributor

bghira commented Aug 7, 2024

cc @sayakpaul @linoytsaban @yiyixuxu

@gameofdimension
Copy link
Contributor Author

gameofdimension commented Aug 8, 2024

it actually adds no noise at all for step 0/1 when using torch.bfloat16. run the following code to reproduce that.

a demo output

image

import torch
from diffusers import DDPMScheduler


def demo(step, latents, noise_scheduler):
    bsz = latents.shape[0]
    noise = torch.randn_like(latents)
    timesteps = torch.randint(
        step, step+1,
        (bsz,), device=latents.device)
    timesteps = timesteps.long()

    noisy_latents = noise_scheduler.add_noise(
        latents, noise, timesteps)

    delta = (noisy_latents-latents).abs().max()
    return delta.item()


def main():
    pretrained_model_name_or_path = '/apps/dat/file/llm/model/stable-diffusion-v1-5'
    # pretrained_model_name_or_path = 'runwayml/stable-diffusion-v1-5'
    noise_scheduler = DDPMScheduler.from_pretrained(
        pretrained_model_name_or_path, subfolder="scheduler")
    bsz = 2
    latents = torch.randn(
        (bsz, 4, 64, 64),
        dtype=torch.bfloat16,
        device='cuda')

    finfo = torch.finfo()
    step = 0
    delta = demo(step, latents, noise_scheduler)
    print(f"delta after add step {step} noise", delta)
    assert delta < finfo.eps

    step = 1
    delta = demo(step, latents, noise_scheduler)
    print(f"delta after add step {step} noise", delta)
    assert delta < finfo.eps

    step = 2
    delta = demo(step, latents, noise_scheduler)
    print(f"delta after add step {step} noise", delta)


if __name__ == '__main__':
    main()

@bghira
Copy link
Contributor

bghira commented Aug 8, 2024

well, step 0 is timestep 0. and there's an off-by-one error throughout Diffusers because of how the scheduler steps in order with the rest of the steps. so you get to the end of the schedule and there's one prediction missing. it has to add an extra zero sigma step to complete it. especially with DDIM.

Timesteps: tensor([0], device='mps:0')
delta after add step 0 noise 0.0
Timesteps: tensor([1000], device='mps:0')
delta after add step 1000 noise 5.6875

is that what you're observing? it's normal not to add noise when the sigma is zero.

@bghira
Copy link
Contributor

bghira commented Aug 8, 2024

ah i see. using float32 makes the first value 0.13. but the batch size changes the amount of noise too.

@gameofdimension
Copy link
Contributor Author

well, step 0 is timestep 0. and there's an off-by-one error throughout Diffusers because of how the scheduler steps in order with the rest of the steps. so you get to the end of the schedule and there's one prediction missing. it has to add an extra zero sigma step to complete it. especially with DDIM.

Timesteps: tensor([0], device='mps:0')
delta after add step 0 noise 0.0
Timesteps: tensor([1000], device='mps:0')
delta after add step 1000 noise 5.6875

is that what you're observing? it's normal not to add noise when the sigma is zero.

what about step 1

@gameofdimension
Copy link
Contributor Author

ah i see. using float32 makes the first value 0.13. but the batch size changes the amount of noise too.

i don't think so. batch size is irrelevant.

@bghira
Copy link
Contributor

bghira commented Aug 8, 2024

no, it literally changes the result

@gameofdimension
Copy link
Contributor Author

my conclusion is, for step 0/1 you didn't add any noise when it's torch.bfloat16, regardless of any value of batch size

@bghira
Copy link
Contributor

bghira commented Aug 8, 2024

    def add_noise(
        self,
        original_samples: torch.Tensor,
        noise: torch.Tensor,
        timesteps: torch.IntTensor,
    ) -> torch.Tensor:
        if original_samples.dtype == torch.bfloat16:
            original_samples = original_samples.to(dtype=torch.float32)
        if noise.dtype == torch.bfloat16:
            noise = noise.to(original_samples.device, dtype=torch.float32)

update DDPM/DDIMScheduler to have this

@bghira
Copy link
Contributor

bghira commented Aug 8, 2024

we're already handling casting on mps using UniPC scheduler but DDPM or DDIM doesn't seem to do it. probably others with the issue

@bghira
Copy link
Contributor

bghira commented Aug 8, 2024

    noise_scheduler = EulerDiscreteScheduler.from_pretrained(
        pretrained_model_name_or_path, subfolder="scheduler", rescale_betas_zero_snr=True, timestep_spacing="trailing")
    bsz = 1
    print(f"Batch size: {1}")
    latents = torch.randn(
        (bsz, 4, 64, 64),
        dtype=torch.bfloat16,
        device='mps')

Batch size: 1
Timesteps: tensor([0], device='mps:0')
delta after add step 0 noise 0.115234375

Euler works on bf16 noise/latents

@gameofdimension
Copy link
Contributor Author

we're already handling casting on mps using UniPC scheduler but DDPM or DDIM doesn't seem to do it. probably others with the issue

it seems that DDPM is usually used when do training.

@bghira
Copy link
Contributor

bghira commented Aug 8, 2024

huggingface uses euler for training

@bghira
Copy link
Contributor

bghira commented Aug 8, 2024

also this is an inference/training issue not necessarily just for training.. though it really only applies during img2img then and i'm not sure what the implications are. it's definitely worse for training.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Sep 14, 2024
@bghira
Copy link
Contributor

bghira commented Sep 14, 2024

again cc @sayakpaul @linoytsaban @yiyixuxu

@sayakpaul
Copy link
Member

Sorry for the delay on my part, @gameofdimension! Did you notice the same behaviour for other scripts or is this specific to ControlNet?

@github-actions github-actions bot removed the stale Issues that haven't received updates label Sep 15, 2024
@gameofdimension
Copy link
Contributor Author

gameofdimension commented Sep 16, 2024

it seems like all calls of noise_scheduler.add_noise are susceptible to this issue, given bf16 training is used

@yiyixuxu yiyixuxu self-assigned this Sep 17, 2024
@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Oct 15, 2024
@a-r-r-o-w a-r-r-o-w removed the stale Issues that haven't received updates label Oct 15, 2024
@a-r-r-o-w a-r-r-o-w requested a review from yiyixuxu October 15, 2024 15:06
@yiyixuxu
Copy link
Collaborator

let's fix this for training first
if we run into issues with inference then we can look into scheduler

@gameofdimension
Copy link
Contributor Author

let's fix this for training first if we run into issues with inference then we can look into scheduler

what else should i do?

@bghira
Copy link
Contributor

bghira commented Oct 21, 2024

idk honestly its ready for merge. diffusers team are you all ok?

@bghira
Copy link
Contributor

bghira commented Oct 21, 2024

@sayakpaul @yiyixuxu this is an issue for all noise addition and makes training with the examples produce worse results

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@yiyixuxu yiyixuxu merged commit 63a0c9e into huggingface:main Oct 21, 2024
8 checks passed
sayakpaul added a commit that referenced this pull request Dec 23, 2024
* Update train_controlnet.py

reduce float value error for bfloat16

* Update train_controlnet_sdxl.py

* style

---------

Co-authored-by: Sayak Paul <[email protected]>
Co-authored-by: yiyixuxu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants