-
Couldn't load subscription status.
- Fork 6.4k
[bugfix] reduce float value error when adding noise #9004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
reduce float value error for bfloat16
|
it actually adds no noise at all for step 0/1 when using torch.bfloat16. run the following code to reproduce that. a demo output import torch
from diffusers import DDPMScheduler
def demo(step, latents, noise_scheduler):
bsz = latents.shape[0]
noise = torch.randn_like(latents)
timesteps = torch.randint(
step, step+1,
(bsz,), device=latents.device)
timesteps = timesteps.long()
noisy_latents = noise_scheduler.add_noise(
latents, noise, timesteps)
delta = (noisy_latents-latents).abs().max()
return delta.item()
def main():
pretrained_model_name_or_path = '/apps/dat/file/llm/model/stable-diffusion-v1-5'
# pretrained_model_name_or_path = 'runwayml/stable-diffusion-v1-5'
noise_scheduler = DDPMScheduler.from_pretrained(
pretrained_model_name_or_path, subfolder="scheduler")
bsz = 2
latents = torch.randn(
(bsz, 4, 64, 64),
dtype=torch.bfloat16,
device='cuda')
finfo = torch.finfo()
step = 0
delta = demo(step, latents, noise_scheduler)
print(f"delta after add step {step} noise", delta)
assert delta < finfo.eps
step = 1
delta = demo(step, latents, noise_scheduler)
print(f"delta after add step {step} noise", delta)
assert delta < finfo.eps
step = 2
delta = demo(step, latents, noise_scheduler)
print(f"delta after add step {step} noise", delta)
if __name__ == '__main__':
main() |
|
well, step 0 is timestep 0. and there's an off-by-one error throughout Diffusers because of how the scheduler steps in order with the rest of the steps. so you get to the end of the schedule and there's one prediction missing. it has to add an extra zero sigma step to complete it. especially with DDIM. Timesteps: tensor([0], device='mps:0')
delta after add step 0 noise 0.0
Timesteps: tensor([1000], device='mps:0')
delta after add step 1000 noise 5.6875is that what you're observing? it's normal not to add noise when the sigma is zero. |
|
ah i see. using float32 makes the first value 0.13. but the batch size changes the amount of noise too. |
what about step 1 |
i don't think so. batch size is irrelevant. |
|
no, it literally changes the result |
|
my conclusion is, for step 0/1 you didn't add any noise when it's torch.bfloat16, regardless of any value of batch size |
def add_noise(
self,
original_samples: torch.Tensor,
noise: torch.Tensor,
timesteps: torch.IntTensor,
) -> torch.Tensor:
if original_samples.dtype == torch.bfloat16:
original_samples = original_samples.to(dtype=torch.float32)
if noise.dtype == torch.bfloat16:
noise = noise.to(original_samples.device, dtype=torch.float32)update DDPM/DDIMScheduler to have this |
|
we're already handling casting on mps using UniPC scheduler but DDPM or DDIM doesn't seem to do it. probably others with the issue |
Euler works on bf16 noise/latents |
it seems that DDPM is usually used when do training. |
|
huggingface uses euler for training |
|
also this is an inference/training issue not necessarily just for training.. though it really only applies during img2img then and i'm not sure what the implications are. it's definitely worse for training. |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
again cc @sayakpaul @linoytsaban @yiyixuxu |
|
Sorry for the delay on my part, @gameofdimension! Did you notice the same behaviour for other scripts or is this specific to ControlNet? |
|
it seems like all calls of |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
let's fix this for training first |
what else should i do? |
|
idk honestly its ready for merge. diffusers team are you all ok? |
|
@sayakpaul @yiyixuxu this is an issue for all noise addition and makes training with the examples produce worse results |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
* Update train_controlnet.py reduce float value error for bfloat16 * Update train_controlnet_sdxl.py * style --------- Co-authored-by: Sayak Paul <[email protected]> Co-authored-by: yiyixuxu <[email protected]>

What does this PR do?
Fixes # (issue)
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.