Anna Gelencsér-Horváth*† · Gergely Dinya* · Péter Halász · Dorka Erős · Islam Muhammad Muqsit · Kristóf Karacs
* Equal contribution. † Corresponding author.
SceneVGGT is a spatio-temporal 3D scene understanding framework that combines SLAM with semantic mapping for autonomous and assistive navigation. It supports online, real-time processing of streamed data (e.g., from an iPhone Pro). The pipeline’s GPU memory usage remains under 17 GB, irrespectively of the length of the input sequence and achieves competitive point-cloud performance on the ScanNet++ benchmark. Overall, SceneVGGT ensures robust semantic identification and is fast enough to support interactive assistive navigation with audio feedback.
- [2026/4/30] Paper accepted for the IEEE ICIP 2026 conference.
- [2026/2/13] Paper released on arXiv.
- [2026/2/12] Code release.
SceneVGGT enables temporally coherent 3D semantic mapping by lifting 2D instance masks into 3D and tracking instances with the VGGT tracking head. Persistent object identities + timestamps provide computationally efficient, temporally consistent change detection, while floor-plane projection of object locations supports downstream assistive navigation—including a proof-of-concept navigation module.
- Clone SceneVGGT
git clone git@github.com:HBVC-AI/SceneVGGT.git
cd SceneGGT- Create conda environment
conda create -n scenevggt python=3.10
conda activate SceneVGGT - Install requirements
pip install -r requirements.txtPlease download VGG-T model from here.
Coming soon.
If you find this project helpful, please consider citing the following paper:
@misc{scenevggt,
title={SceneVGGT: VGGT-based online 3D semantic SLAM for indoor scene understanding and navigation},
author={Anna Gelencsér-Horváth and Gergely Dinya and Dorka Boglárka Erős and Péter Halász and Islam Muhammad Muqsit and Kristóf Karacs},
year={2026},
eprint={2602.15899},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2602.15899},
}

