-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Hello, I have some queations about training settings.
- In the inference code, there is a line that says:
conditional_latents_mask = mask_token.repeat(bsz_cfg, num_frames-2, 1, latent_h, latent_w)
It seems like two batches were used for CFG, but instead of using 0 for the unconditional part, the same values as the conditional part were repeated. Is there a specific reason for this approach? Was the model trained entirely with conditional training without any separate unconditional training?
-
Also, in the original SVD Xtend code, a learning rate of 1e-5 is typically used, but the Framer paper mentions using a learning rate of 1e-4. Is there a specific reason for this difference?
-
The SVD pretrained model used here generates 25 frames at a resolution of 1024x576, but isn’t there also a model that generates 14 frames at 512x320? The frame setting seems closer to the latter; is there a reason for choosing the former model?