-
Notifications
You must be signed in to change notification settings - Fork 208
Description
Hello, thanks for your great work on CatVTON.
While reproducing the mask-based model, I had some questions regarding the training process:
-
In the training step, should the loss be calculated only on the masked regions, or over the entire latent tensor?
-
During the DREAM step, should we perform an additional forward pass through the UNet to obtain the predicted noise, or reuse the initially predicted noise?
-
Is it intended to sample training timesteps only from the later part of the schedule (e.g., 500–1000), instead of the full range (1–1000)?
-
In mask-based mode, is the reference image latent (concatenated along spatial dimension) effectively utilized during denoising, or is any additional regularization required to ensure clothing information transfer?
Thanks for your clarification!