Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset - Official Codebase
This work is accepted by ACM MM 2025.
This guide will walk you through setting up the environment and data to run our code.
- Create a new conda environment (Python 3.10.10 is recommended).
- Install
torch
2.1 and other Python packages. We recommend usinguv
for faster installation.
pip install -U pip
pip install uv
uv pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 # Example torch installation
uv pip install -r requirements.txt
- Install Java for the METEOR evaluation package (used for Scan2Cap).
-
Download Data & Checkpoints: Download the necessary components from the links below.
Component Link Description Compiled Data "SVC" Download Our pre-processed datasets, features and annotations. ScanNet 2D Views Download Original 2D views from ScanNet. Pre-Trained LEGO Checkpoint Download Our pre-trained model checkpoints. Mask3D Detection Results Download Needed for inference on dense captioning tasks. LEO's Point Clouds Download Only needed if you run data preparation from scratch. -
Organize Files: Unzip the downloaded files and arrange them according to the following directory structure. You will also need to update the
SVC_PATH
variable infuyu_utils.py
to point to your data directory.<REPO_PARENT>/ |--<SVC_PATH>/ # Your main data directory | |--frames_square/ # Unzipped ScanNet 2D Views | |--scannet_data/ # Unzipped from SVC's scannet_data.zip | |--save_mask/ # Unzipped Mask3D detection results | |--pcd_with_global_alignment/ # Unzipped LEO's point clouds | |--... # Other files from SVC data |--<REPO_PATH>/ # Cloned this repository (MVScanQA) | |--finetune_fuyu.sh | |--...
Note: Some scripts download models from Hugging Face. If you are in a region with restricted access, you may need to set
HF_ENDPOINT
orALL_PROXY
.
Please refer to DATA PREPARATION for detailed data preparation steps. You only need to run these steps if you want to regenerate the data from scratch.
Note: Training typically requires GPUs with 40GB of VRAM and 150-300GB of system RAM (for 4-8 GPUs). Results are saved to
<REPO_PARENT>/kuri3d-output
. Please log in towandb
to track metrics, or disable it withwandb disabled
.
We have pre-extracted and included the 3D object features in the "SVC" data package. You only need to run this step if you want to regenerate the features from scratch.
- Download the pre-trained 3D detector from Vote2Cap-DETR or our compiled data.
- Compile and install PointNet++:
cd lib/pointnet2
python setup.py install
- Run the script to extract 3D object features:
./pre-extract-pnpp.sh
- 1st Stage (Optional): Pre-train a 3D feature adapter with the 2D LVLM backbone frozen. We found this stage has a minor impact on final performance, so feel free to skip it to save time. We also provide the pre-trained 3D feature adapter in "SVC" data package.
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
./finetune_fuyu_1st_stage.sh
- 2nd Stage: Pre-train the full model on the complete TripAlign dataset using LoRA.
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
./finetune_fuyu.sh
We found finetuning to be beneficial for MV-ScanQA and SQA3D. For other tasks, we recommend using the pre-trained model directly.
# On MV-ScanQA
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
./finetune_fuyu_mvscanqa.sh --checkpoint_path <path_to_pretrained_checkpoint>
# On SQA3D
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
./finetune_fuyu_downstream.sh --checkpoint_path <path_to_pretrained_checkpoint>
We also provide finetuned checkpoint for MV-ScanQA here.
Here are the reproduced results from running this cleaned script.
Checkpoint | Dataset | Metric | Result |
---|---|---|---|
best-pretrained-reproduced | ScanQA (val) | EM | 28.3 |
ScanQA (test with object) | EM | [N/A due to eval.ai outage] | |
ScanQA (test without object) | EM | [N/A due to eval.ai outage] | |
Scan2Cap (on ScanRefer) | [email protected] | 83.9 | |
Scan2Cap (on ScanRefer) | [email protected] | 78.0 | |
Scan2Cap (on Nr3D) | [email protected] | 62.8 | |
best-pretrained-reproduced + best-scanqa-mv_em | MV-ScanQA | EM | 33.7 |
Please refer to Inference for detailed inference script.
Once LEGO is trained, you can run inference on downstream tasks. Below are example commands. Please change the dataset options in the shell scripts as needed.
# ScanQA (validation)
./predict_fuyu.sh --checkpoint_path <path_to_checkpoint> --add_scanqa
# ScanQA (test)
./predict_fuyu.sh --checkpoint_path <path_to_checkpoint> --add_scanqa --add_scanqa_test
# Scan2Cap (ScanRefer)
./predict_fuyu.sh --checkpoint_path <path_to_checkpoint> --add_scan2cap
# Scan2Cap (Nr3D)
./predict_fuyu.sh --checkpoint_path <path_to_checkpoint> --add_nr3d --add_nr3d_val
# MV-ScanQA | We add an small additional LoRA when finetuning, so here the pre-trained LoRA and finetune LoRA shall be both specified
./predict_fuyu.sh --checkpoint_path <path_to_checkpoint> --add_scanqa_mv --multiple_input_images "2x2" --base_model <path_to_pretrained_checkpoint>
Note: For ScanQA test set performance, you need to submit the generated result files to the official Eval.ai platform. Run this script to convert prediction files to formatted ready to submit file:
python prepare_scanqa_submission.py --prediction <path_to_prediction_json_file>
- Upload pre-trained checkpoints; Upload scene-view-object IoSA ratios.
- Upload pre-trained 3D detector; Upload 1st stage pre-trained 3D feature adapter.
- Fix file locations
- Add view selection codes and docs; Correct file locations.
- Add gradient checkpointing for pre-training and finetuning, for low-memory GPUs like RTX 3090.
- Update correct
accelerate+transformers+peft
versions in requirements.txt. - Add sample selection code for TripAlign
- Test cleaned scripts to reproduce reported performances.
- Update inference for each dataset.
- Update bibtex.
We would like to thank facebookresearch/votenet and ch3cook-fdu/Vote2Cap-DETR for the 3D object detector code and pre-trained weights.
If you find this codebase useful, please consider citing our work:
@inproceedings{mo2025mvscanqa,
title={Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset},
author={Mo, Wentao and Chen, QingChao and Peng, Yuxin and Huang, Siyuan and Liu, Yang},
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
year={2025},
}
This code repository and datasets are licensed under a CC-BY-4.0 license.
Copyright (c) 2025 Wentao Mo.