Skip to content

Failure in Fine-tuning VGGT for Feed-forward Novel View Synthesis #458

@RuiHuang9

Description

@RuiHuang9

I have made extensive attempts but consistently failed to reproduce the fine-tuning of VGGT for the novel view synthesis (NVS) task, as the training loss does not decrease. Below are the specific details of my reproduction process. I hope you can help identify the issues:

Data Input:
1.For each scene from the DL3DV dataset, I randomly sample 5 views along with their camera parameters as a data sample. The order of the five views follows the original dataset sequence, with the first four as reference views and the last as the target view.
2.The camera-to-world matrices (c2w) are transformed from the OpenGL coordinate system to the OpenCV coordinate system by negating the second and third columns (experiments without coordinate transformation were also attempted).
3.Normalization approaches tried:
a. No normalization: Between 5 views in a sample, the c2w of the first view is converted to an identity matrix, and the other c2ws are transformed accordingly.
b. Using your provided normalize_camera_extrinsics_and_points_batch() function to normalize the entire scene, then between 5 views in a sample, converting the first view’s c2w to identity and adjusting the others accordingly.
c. Using the same function to normalize only between 5 views in a sample.
4. After normalization, the target view’s camera representation is converted to Plücker coordinates.

VGGT Model Modifications:
1.The Plücker coordinates of the target view are embedded via a custom convolutional layer or the class PatchEmbed from your vggt.layer. These are concatenated with patch tokens extracted from input images using DINO.
2.The original DPT heads are entirely removed and replaced with a modified version of 3D point head that outputs RGB values.
3.The modified head was tested both without activation and with a sigmoid activation.
4.The loss function follows LVSM: L2 loss + 0.5 × perceptual loss (using LVSM’s perceptual loss implementation).

Training Setup:
1.Hardware: Single GPU (NVIDIA RTX 4090D, 48GB), CUDA 11.8, PyTorch 2.9.0, WSL2 with Ubuntu 22.04.
2.Hyperparameters: Gradient accumulation steps and batch sizes tried: 1x1, 4x4, 8x8; learning rate: 1e-4 (1e-3 caused a sharp increase in loss).

Observations:
1.Training and testing on a single sample (5 views)​ can achieve overfitting.
2.Training and testing on a single scene (60+ samples, 300+ views, randomly ordered)​ can also overfit after several hundred iterations.
3.On the full dataset, where training and validation sets belong to disjoint subsets, the training loss plateaus after thousands of iterations:
a.Training outputs remain in a narrow range (min ≥ 0.4, max ≤ 0.6).
b.Validation outputs degrade to a uniform gray image.
c.The perceptual loss oscillates around 0.6 and does not decrease.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions