Diffusion Loss is LYING to Optimizer! (AKA my addition to Everything you know about loss is a LIE! #294) #1994

jilek-josef · 2025-03-19T09:23:22Z

jilek-josef
Mar 19, 2025

Well not exactly, I just wanted funny title but it's not completely wrong either, I will try to explain.

Timesteps, Optimizers, Batches & Loss

First, let's recapitulate what we know so far. A diffusion models are based on something we call de-noising process. The whole thing the models does is removing noise from image (or more like something we call latent space but for simplicity let’s imagine it as image for now). So during learning we add noise to the images as well and the amount of noise is defined by timestep. The higher timestep the more noise is added, the lower timestep the less. Which is all fine except the part it breaks our loss function.

The model naturally can’t be the same accurate when facing a lot of noise, while it’s much simpler task when there is less noise. Which naturally results in higher loss.

Usually in machine learning, loss is direct indicate how good something is matching but here due to timesteps, loss is only really meaningful when compared among same timesteps. This issue is sort of mitigated when we do really high batches like 100+ or more like 1000+ batch size, then the loss averages among all and is relatively stable.

With LoRA that is however issue, we can’t have that big sizes here due to compute and VRAM constrains, so the loss can’t be averaged in batch good enough and fluctuates. This is amplified by that the timesteps in the single batch are random so we sometime can take all low or all high batch.

Why loss is important? Well it directly guides and affect optimizer. Higher loss means effectively bigger learning rate and when there are many high loss and than suddenly low loss is introduces it causes overshooting. Imagine you are basically blindly doing something different ways and always being said it’s no good and than suddenly you get praised “this is the best thing you have ever did”, you would get confused right, or perhaps thinking you have finally made it. Well this is sort of the same situation as when optimizer suddenly gets low loss.

Loli Optimizer: A New Approach

To address the loss issue we have sort of 2 options. Either we can somehow scale the loss directly, for example multiply the loss by 2 when timestep lower than x but than we would be lying to optimizer and while more stable, it will have side effects such as increased learning rate. It’s not possible to generalize this either since it may differ a bit depending on what we are training like style vs character.

The second option I come up with is creating a special “timestep aware” optimizer. To be fair, I don’t think that is effectively possible but I come up with workaround. Why not to cluster timesteps and have different optimizer states for each cluster?

And so I did that. But it requires one optimization to work properly. Instead of having random timesteps in batch, we need to have only timesteps from one cluster par batch. This way we get interesting effect, we update more or less different weights per different cluster. Which is good and actually sort of required since having different optimizer states means they sort of can do different and possibly contradicting updates.

A good balance I think is having about 10 clusters. There is not much benefit having more. Having less, like 5 is also acceptable but going beyond that may start to be problematic. Also while this should be applicable for full training, possibly speeding convergence as well, it may not be the best thing there, since having 10 clusters means 10x more memory consumption on optimizer. For LoRA this works since the optimizer size scale with LoRA rank.

Loli Loss Optimization

There are multiple loss functions and each works a bit different and is suitable for different thing. There is commonly used MSE which tries to enforce pixel perfect state by a lot, that may be really good thing but for diffusion not always. Let’s say we want to have a tree in the image, The tree is on one side but in our example is on other side. MSE in that case would act like everything is wrong.

Than there is L1 which sort of balances this approach and does not try to be pixel perfect so much. We can base loss on cosine similarity as well which is accepts some position shifts as rather fine thing.

L1 is cool and probably most balanced approach out here but why not to use all 3? Computing loss is cheap so we can just compute all 3 and than sum them. Such as MSE*0.4+L1*0.4+cos_loss*0.2. We can even make this a parameter of the training and try to balance more but this showed to be rather better alternative to pure MSE.

Code

This was specifically tested on Flux/Chroma, however it should generalize to any diffusion model. Here is a code which implements that for Chroma LoRA trainer https://github.com/jilek-josef/flow-chroma

This ultimately needs more tests, since there may be scenarios when this is not ideal, however from my testing it helped a lot, especially for Flux/Chroma where without that the learning was significantly unstable and not doing well.

SlZeroth · 2025-03-25T08:21:20Z

SlZeroth
Mar 25, 2025

I tested with flow-chroma model.
quality is better than flux.dev.
I hope to train chroma model with sd-script.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Diffusion Loss is LYING to Optimizer! (AKA my addition to Everything you know about loss is a LIE! #294) #1994

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Diffusion Loss is LYING to Optimizer! (AKA my addition to Everything you know about loss is a LIE! #294) #1994

Uh oh!

jilek-josef Mar 19, 2025

Timesteps, Optimizers, Batches & Loss

Loli Optimizer: A New Approach

Loli Loss Optimization

Code

Replies: 1 comment

Uh oh!

SlZeroth Mar 25, 2025

jilek-josef
Mar 19, 2025

SlZeroth
Mar 25, 2025