In the following experiments, we incorrectly scale the focal point by a factor defined in src/misc/dl3dv_utils.py; line:45. The trained model in this case would just adapt to the incorrect scaling factor since the representation is structured implicitly and can only be observed through the renderer. For consistency with standard 3D approaches, we will fix this bug in the future versions.
scenetok_va-vdc_shift4_dl3dv_finetuned
scenetok_va-vdc_shift8_dl3dv_finetunedThe intrinsics are already normalized by the stored height and width therefore does not require additional scaling. The dont have this bug in other experiments.
Similar to above, we incorrectly scale the focal point of the context views only since the experiments used the same precomputed latents for VA-VAE as in the previous issue
scenetok_va-wan_shift4_dl3dv_finetuned
scenetok_va-wan_shift8_dl3dv_finetunedIn src/model/embedding/lvsm_embed.py, we provide with the parameter temporal_downsample for the reshapping layer as shown in the following,
nn.Sequential(
Rearrange(
"b ... (t c) (hh ph) (ww pw) -> b ... (hh ww) (t ph pw c)",
ph=cfg.patch_size,
pw=cfg.patch_size,
t=temporal_downsample
),
nn.Linear(
cfg.in_channels * (cfg.patch_size**2),
embed_dim,
bias=False,
),
)Normally the correct value should be temporal_downsample=4 for both VideoDCAE and WanVAE, but the model we trained for WanVAE has temporal_downsample=1. This should not impact the downstream camera controlled rendering, since only the order of channels are different before applying MLP projection. The following configs are affected by this bug, (same as in 2)
scenetok_va-wan_shift4_dl3dv_finetuned
scenetok_va-wan_shift8_dl3dv_finetunedNote
In all future experiments, we will address these bugs, and for trained checkpoints, we added parameters to allow these intentionally for inference. Make sure to disable them when training your own model from scratch.
dataset.scale_focal_by_256: true # for Bug: 1
dataset.scale_context_focal_by_256: true # for Bug: 2
model.force_incorrect: true # for Bug: 3