Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset - Official Codebase

This work is accepted by ACM MM 2025.

Getting Started

This guide will walk you through setting up the environment and data to run our code.

1. Environment Setup

Create a new conda environment (Python 3.10.10 is recommended).
Install torch 2.1 and other Python packages. We recommend using uv for faster installation.

pip install -U pip
pip install uv
uv pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 # Example torch installation
uv pip install -r requirements.txt

Install Java for the METEOR evaluation package (used for Scan2Cap).

2. Data Setup

Download Data & Checkpoints: Download the necessary components from the links below.

Component	Link	Description
Compiled Data "SVC"	Download	Our pre-processed datasets, features and annotations.
ScanNet 2D Views	Download	Original 2D views from ScanNet.
Pre-Trained LEGO Checkpoint	Download	Our pre-trained model checkpoints.
Mask3D Detection Results	Download	Needed for inference on dense captioning tasks.
LEO's Point Clouds	Download	Only needed if you run data preparation from scratch.

Organize Files: Unzip the downloaded files and arrange them according to the following directory structure. You will also need to update the SVC_PATH variable in fuyu_utils.py to point to your data directory.

<REPO_PARENT>/
|--<SVC_PATH>/                  # Your main data directory
|  |--frames_square/           # Unzipped ScanNet 2D Views
|  |--scannet_data/            # Unzipped from SVC's scannet_data.zip
|  |--save_mask/               # Unzipped Mask3D detection results
|  |--pcd_with_global_alignment/ # Unzipped LEO's point clouds
|  |--...                      # Other files from SVC data
|--<REPO_PATH>/                # Cloned this repository (MVScanQA)
|  |--finetune_fuyu.sh
|  |--...

Note: Some scripts download models from Hugging Face. If you are in a region with restricted access, you may need to set HF_ENDPOINT or ALL_PROXY.

Please refer to DATA PREPARATION for detailed data preparation steps. You only need to run these steps if you want to regenerate the data from scratch.

Training

Note: Training typically requires GPUs with 40GB of VRAM and 150-300GB of system RAM (for 4-8 GPUs). Results are saved to <REPO_PARENT>/kuri3d-output. Please log in to wandb to track metrics, or disable it with wandb disabled.

Pre-extract 3D Object Features (Optional)

We have pre-extracted and included the 3D object features in the "SVC" data package. You only need to run this step if you want to regenerate the features from scratch.

Download the pre-trained 3D detector from Vote2Cap-DETR or our compiled data.
Compile and install PointNet++:

cd lib/pointnet2
python setup.py install

Run the script to extract 3D object features:

./pre-extract-pnpp.sh

TripAlign Pre-training

1st Stage (Optional): Pre-train a 3D feature adapter with the 2D LVLM backbone frozen. We found this stage has a minor impact on final performance, so feel free to skip it to save time. We also provide the pre-trained 3D feature adapter in "SVC" data package.

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
./finetune_fuyu_1st_stage.sh

2nd Stage: Pre-train the full model on the complete TripAlign dataset using LoRA.

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
./finetune_fuyu.sh

Finetuning on Downstream Tasks

We found finetuning to be beneficial for MV-ScanQA and SQA3D. For other tasks, we recommend using the pre-trained model directly.

# On MV-ScanQA
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
./finetune_fuyu_mvscanqa.sh --checkpoint_path <path_to_pretrained_checkpoint>

# On SQA3D
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
./finetune_fuyu_downstream.sh --checkpoint_path <path_to_pretrained_checkpoint>

We also provide finetuned checkpoint for MV-ScanQA here.

Results

Here are the reproduced results from running this cleaned script.

Checkpoint	Dataset	Metric	Result
best-pretrained-reproduced	ScanQA (val)	EM	28.3
	ScanQA (test with object)	EM	[N/A due to eval.ai outage]
	ScanQA (test without object)	EM	[N/A due to eval.ai outage]
	Scan2Cap (on ScanRefer)	[email protected]	83.9
	Scan2Cap (on ScanRefer)	[email protected]	78.0
	Scan2Cap (on Nr3D)	[email protected]	62.8
best-pretrained-reproduced + best-scanqa-mv_em	MV-ScanQA	EM	33.7

Please refer to Inference for detailed inference script.

Inference

Once LEGO is trained, you can run inference on downstream tasks. Below are example commands. Please change the dataset options in the shell scripts as needed.

# ScanQA (validation)
./predict_fuyu.sh --checkpoint_path <path_to_checkpoint> --add_scanqa
# ScanQA (test)
./predict_fuyu.sh --checkpoint_path <path_to_checkpoint> --add_scanqa --add_scanqa_test
# Scan2Cap (ScanRefer)
./predict_fuyu.sh --checkpoint_path <path_to_checkpoint> --add_scan2cap
# Scan2Cap (Nr3D)
./predict_fuyu.sh --checkpoint_path <path_to_checkpoint> --add_nr3d --add_nr3d_val
# MV-ScanQA | We add an small additional LoRA when finetuning, so here the pre-trained LoRA and finetune LoRA shall be both specified
./predict_fuyu.sh --checkpoint_path <path_to_checkpoint> --add_scanqa_mv --multiple_input_images "2x2" --base_model <path_to_pretrained_checkpoint>

Note: For ScanQA test set performance, you need to submit the generated result files to the official Eval.ai platform. Run this script to convert prediction files to formatted ready to submit file:
python prepare_scanqa_submission.py --prediction <path_to_prediction_json_file>

TODO

Acknowledgements

We would like to thank facebookresearch/votenet and ch3cook-fdu/Vote2Cap-DETR for the 3D object detector code and pre-trained weights.

Citation

If you find this codebase useful, please consider citing our work:

@inproceedings{mo2025mvscanqa,
  title={Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset},
  author={Mo, Wentao and Chen, QingChao and Peng, Yuxin and Huang, Siyuan and Liu, Yang},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  year={2025},
}

License

This code repository and datasets are licensed under a CC-BY-4.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
capeval		capeval
data		data
data_utils		data_utils
docs		docs
lib		lib
models		models
scripts		scripts
utils		utils
utils_vote2cap		utils_vote2cap
.gitignore		.gitignore
DATA.md		DATA.md
README.md		README.md
__init__.py		__init__.py
clean_answer.py		clean_answer.py
config-fuyu.json		config-fuyu.json
envs.sh		envs.sh
eval.sh		eval.sh
extract_vote2cap_detr_bbox.py		extract_vote2cap_detr_bbox.py
finetune-fuyu.yaml		finetune-fuyu.yaml
finetune_fuyu.sh		finetune_fuyu.sh
finetune_fuyu_1st_stage.sh		finetune_fuyu_1st_stage.sh
finetune_fuyu_downstream.sh		finetune_fuyu_downstream.sh
finetune_fuyu_mvqa.sh		finetune_fuyu_mvqa.sh
fuyu_align_utils.py		fuyu_align_utils.py
fuyu_utils.py		fuyu_utils.py
iou3d.py		iou3d.py
lego-data.yaml		lego-data.yaml
pre-extract-pnpp-feature.py		pre-extract-pnpp-feature.py
pre-extract-pnpp.sh		pre-extract-pnpp.sh
pre-extract-vote2cap-feature.py		pre-extract-vote2cap-feature.py
predict-fuyu.py		predict-fuyu.py
predict_fuyu.sh		predict_fuyu.sh
prepare_scanqa_submission.py		prepare_scanqa_submission.py
preprocess_mask3d_feats.py		preprocess_mask3d_feats.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_and_tail.sh		run_and_tail.sh
strict-requirements.txt		strict-requirements.txt
train-fuyu-merged-for-qa.py		train-fuyu-merged-for-qa.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset - Official Codebase

Contents

Getting Started

1. Environment Setup

2. Data Setup

Training

Pre-extract 3D Object Features (Optional)

TripAlign Pre-training

Finetuning on Downstream Tasks

Results

Inference

TODO

Acknowledgements

Citation

License

About

Uh oh!

Releases

Packages

Languages

matthewdm0816/MVScanQA

Folders and files

Latest commit

History

Repository files navigation

Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset - Official Codebase

Contents

Getting Started

1. Environment Setup

2. Data Setup

Training

Pre-extract 3D Object Features (Optional)

TripAlign Pre-training

Finetuning on Downstream Tasks

Results

Inference

TODO

Acknowledgements

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages