-
Notifications
You must be signed in to change notification settings - Fork 120
Description
Hey,
I'm currently tracing out the story of diffusion generative models and right now, I'm studying the denoising score matching objective (DSM). I've noticed that your multi-scale approach relies heavily on it (and the original paper is quite old), so I decided that to ask my question here.
I gone through the theory of DSM and got a good grip on how it works and why it works. However, in practice I observe slow convergence (much slower convergence than with ISM) on toy examples. In particular I believe this might be due type of noise distribution selected. While not restricted, it seems everyone goes with a normal distribution since it provides a simple derivative. The derivate being 1/sigma**2 * (orig-perturbed). In practive, I've observed that the scale term in front causes the derivative to take values on the order of 1e4 for sigma=1e-2 and loss jumps around quite heavily. The smaller sigma, the slower the convergence. The loss never actually decreases, but the resulting gradient field looks comparable to what ISM gives.
Did you observe this in your experiments as well?