Hey, thanks for open-sourcing this code!
I had a quick question about the finetune_unet function in train.py: why are there two forward passes and loss computations through the unet?
Is it to implement some sort of self-conditioning which I've read about in some text-to-image diffusion papers (or could you point to the part of the paper that corresponds to)?
Thanks!