train question

Hello, thanks for your great work on CatVTON.

While reproducing the mask-based model, I had some questions regarding the training process:

1. In the training step, should the loss be calculated only on the masked regions, or over the entire latent tensor?

2. During the DREAM step, should we perform an additional forward pass through the UNet to obtain the predicted noise, or reuse the initially predicted noise?

3. Is it intended to sample training timesteps only from the later part of the schedule (e.g., 500–1000), instead of the full range (1–1000)?

4. In mask-based mode, is the reference image latent (concatenated along spatial dimension) effectively utilized during denoising, or is any additional regularization required to ensure clothing information transfer?

Thanks for your clarification!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train question #120

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

train question #120

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions