Skip to content

train question #120

@zzunnii

Description

@zzunnii

Hello, thanks for your great work on CatVTON.

While reproducing the mask-based model, I had some questions regarding the training process:

  1. In the training step, should the loss be calculated only on the masked regions, or over the entire latent tensor?

  2. During the DREAM step, should we perform an additional forward pass through the UNet to obtain the predicted noise, or reuse the initially predicted noise?

  3. Is it intended to sample training timesteps only from the later part of the schedule (e.g., 500–1000), instead of the full range (1–1000)?

  4. In mask-based mode, is the reference image latent (concatenated along spatial dimension) effectively utilized during denoising, or is any additional regularization required to ensure clothing information transfer?

Thanks for your clarification!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions