feat(scheduler): Add scale_betas_for_timesteps to DDPMScheduler #12341

seotaekkong · 2025-09-17T00:31:34Z

What does this PR do?

This PR introduces a new boolean flag, scale_betas_for_timesteps, to the DDPMScheduler. This flag provides an optional, more robust way to handle the beta schedule when num_train_timesteps is set to a value other than the default of 1000.

Motivation and Context

The default parameters for the DDPMScheduler (beta_start=0.0001, beta_end=0.02) are implicitly tuned for num_train_timesteps=1000. This creates a potential "usability trap" for practitioners who may change the number of training timesteps without realizing they should also adjust the beta range.

If a user sets num_train_timesteps to a large value (e.g., 4000), the linear beta schedule becomes too shallow, and noise is added too slowly.
If num_train_timesteps is set to a small value (e.g., 200), the schedule becomes too steep, and noise is added too aggressively.
Both scenarios can lead to suboptimal training performance that is difficult to debug.

Proposed Solution

This PR introduces an opt-in solution to this problem.

A new flag, scale_betas_for_timesteps, is added to the scheduler's __init__ method.
It defaults to False to ensure 100% backward compatibility with existing code.
When set to True, it automatically scales the beta_end parameter using a simple heuristic (beta_end * (1000 / num_train_timesteps)). This ensures that the overall noise schedule remains sensible and robust, regardless of the number of training steps chosen by the user.
The scaled beta_end is used by schedules dependent on it (e.g., linear, scaled_linear), while schedules that do not use this parameter (e.g., squaredcos_cap_v2) are naturally unaffected.
This change makes the scheduler more intuitive and helps prevent common configuration errors.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

As suggested by the contribution guide for schedulers: @yiyixuxu

seotaekkong · 2025-09-17T02:57:28Z

Hi, I wanted to add a quick comment with some further justification for this change. The PR addresses a subtle but critical issue where changing num_train_timesteps can cause the asymptotic variance of the forward process to deviate from unity, breaking the assumption that $x_T$ matches the standard Gaussian prior that the reverse process starts from.

Problem

Using the standard notation

$$\bar{\alpha}_t = \prod_{i=1}^t (1 - \beta_i)$$

the variance of the final state $x_T$ is given by $Var(x_T) = 1 - ᾱ_T$. For the sampling process to match a standard Gaussian prior N(0, 1), we require $Var(x_T) \approx 1$. The value $ᾱ_T$ is controlled by the sum of betas since $-\log ᾱ_T \approx \sum_{i=1}^T \beta_i$. If this sum is too small, $ᾱ_T$ will not be close to zero and the variance will be incorrect.

How the Current Implementation Fails

The current implementation leads to an inconsistent $\sum_i \beta_i$ when $T$ is changed, which breaks the unit variance assumption.

Default T = 1000: $\sum_i \beta_i \approx 10.05$ and $Var(x_T) \approx 1$.
Naive change T = 200: $\sum_i \beta_i \approx 2.01$ which is far too small. This results in $Var(x_T) \approx 0.87$ and the prior is incorrect, which can cause significant issues during the reverse sampling process.

How the Proposed Fix Works

The proposed fix of scaling beta_end ensures the sum of betas remains approximately constant, thereby preserving the unit variance of the final state.

With T = 200, beta_end is scaled to 0.1 and $\sum_i \beta_i \approx 10.01$.
This ensures $\bar{\alpha}_T \approx 0$ and in turn $Var(x_T) \approx 1$, preserving the mathematical integrity of the diffusion process.
This change ensures the scheduler produces a theoretically sound noise schedule by default, preventing users from having to manually correct for variance issues when experimenting with different numbers of training timesteps.

seotaekkong · 2025-10-10T21:42:14Z

Hi @yiyixuxu ,

I wanted to gently follow up on this PR and provide some new empirical evidence for the practical impact of this change. I ran a simple experiment training a diffusion model on the butterfly dataset with num_train_timesteps=200, where the only difference between the two runs was the scale_betas_for_timesteps flag.

The images below show a clear improvement in sample quality, directly confirming the theoretical motivation for this PR.

Image 1: Before Fix Image 2: After Fix (This PR)

feat(scheduler): Add scale_betas_for_timesteps to DDPMScheduler

eb73c11

Merge branch 'main' into feature/scale-betas-for-timesteps

21dc98c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(scheduler): Add scale_betas_for_timesteps to DDPMScheduler #12341

feat(scheduler): Add scale_betas_for_timesteps to DDPMScheduler #12341

Uh oh!

seotaekkong commented Sep 17, 2025

Uh oh!

seotaekkong commented Sep 17, 2025

Uh oh!

seotaekkong commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(scheduler): Add scale_betas_for_timesteps to DDPMScheduler #12341

Are you sure you want to change the base?

feat(scheduler): Add scale_betas_for_timesteps to DDPMScheduler #12341

Uh oh!

Conversation

seotaekkong commented Sep 17, 2025

What does this PR do?

Motivation and Context

Proposed Solution

Before submitting

Who can review?

Uh oh!

seotaekkong commented Sep 17, 2025

Problem

How the Current Implementation Fails

How the Proposed Fix Works

Uh oh!

seotaekkong commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant