-
Notifications
You must be signed in to change notification settings - Fork 6.5k
Open
Labels
bugSomething isn't workingSomething isn't workingstaleIssues that haven't received updatesIssues that haven't received updates
Description
Describe the bug
I think it is not intended
In default pipe pipe.scheduler.init_noise_sigma = 500, first it scales up noise and then add latent of provided initial_audio_waveforms
So latent variable is like [-2000, 2000] while encoded audio is like [-4, 4]
I think this should be correct math here
latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
encoded_audio = ...
latents = (encoded_audio + latents) / 2
# scale the initial noise by the standard deviation required by the scheduler
latents = latents * self.scheduler.init_noise_sigma
Reproduction
import torch
import soundfile as sf
from diffusers import StableAudioPipeline
pipe = StableAudioPipeline.from_pretrained("stabilityai/stable-audio-open-1.0", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
prompt = "ominous primordial melodies"
negative_prompt = "Low quality."
import torchaudio
data, samplerate = sf.read('path to some audio')
generator = torch.Generator("cuda").manual_seed(0)
audio = pipe(
prompt,
negative_prompt=negative_prompt,
num_inference_steps=200,
num_waveforms_per_prompt=1,
initial_audio_waveforms=data.to("cuda").to(torch.bfloat16),
initial_audio_sampling_rate=pipe.vae.sampling_rate,
generator=generator,
).audios
Logs
System Info
diffusers version 0.32.2
Who can help?
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstaleIssues that haven't received updatesIssues that haven't received updates