Skip to content

In StableAudioPipeline initial_audio_waveforms basically have no effect on output because of latent scaling #10861

@hadaev8

Description

@hadaev8

Describe the bug

I think it is not intended
In default pipe pipe.scheduler.init_noise_sigma = 500, first it scales up noise and then add latent of provided initial_audio_waveforms
So latent variable is like [-2000, 2000] while encoded audio is like [-4, 4]

I think this should be correct math here

latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
encoded_audio = ...
latents = (encoded_audio + latents) / 2
# scale the initial noise by the standard deviation required by the scheduler
latents = latents * self.scheduler.init_noise_sigma

Reproduction

import torch
import soundfile as sf
from diffusers import StableAudioPipeline

pipe = StableAudioPipeline.from_pretrained("stabilityai/stable-audio-open-1.0", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")

prompt = "ominous primordial melodies"
negative_prompt = "Low quality."

import torchaudio
data, samplerate = sf.read('path to some audio')

generator = torch.Generator("cuda").manual_seed(0)
audio = pipe(
    prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=200,
    num_waveforms_per_prompt=1,
    initial_audio_waveforms=data.to("cuda").to(torch.bfloat16),
    initial_audio_sampling_rate=pipe.vae.sampling_rate,
    generator=generator,
).audios

Logs

System Info

diffusers version 0.32.2

Who can help?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleIssues that haven't received updates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions