Skip to content

HBVC-AI/SceneVGGT

Repository files navigation

SceneVGGT: VGGT-based online 3D semantic SLAM for indoor scene understanding and navigation

Paper

Anna Gelencsér-Horváth* · Gergely Dinya* · Péter Halász · Dorka Erős · Islam Muhammad Muqsit · Kristóf Karacs

* Equal contribution. Corresponding author.

SceneVGGT is a spatio-temporal 3D scene understanding framework that combines SLAM with semantic mapping for autonomous and assistive navigation. It supports online, real-time processing of streamed data (e.g., from an iPhone Pro). The pipeline’s GPU memory usage remains under 17 GB, irrespectively of the length of the input sequence and achieves competitive point-cloud performance on the ScanNet++ benchmark. Overall, SceneVGGT ensures robust semantic identification and is fast enough to support interactive assistive navigation with audio feedback.

News

  • [2026/4/30] Paper accepted for the IEEE ICIP 2026 conference.
  • [2026/2/13] Paper released on arXiv.
  • [2026/2/12] Code release.

Overview

SceneVGGT enables temporally coherent 3D semantic mapping by lifting 2D instance masks into 3D and tracking instances with the VGGT tracking head. Persistent object identities + timestamps provide computationally efficient, temporally consistent change detection, while floor-plane projection of object locations supports downstream assistive navigation—including a proof-of-concept navigation module.

3D semantic SLAM and navigation from Streaming Inputs

Installation

  1. Clone SceneVGGT
git clone git@github.com:HBVC-AI/SceneVGGT.git
cd SceneGGT
  1. Create conda environment
conda create -n scenevggt python=3.10
conda activate SceneVGGT 
  1. Install requirements
pip install -r requirements.txt

Download Checkpoints

Please download VGG-T model from here.

Evaluation codes

Coming soon.

Citation

If you find this project helpful, please consider citing the following paper:

@misc{scenevggt,
      title={SceneVGGT: VGGT-based online 3D semantic SLAM for indoor scene understanding and navigation}, 
      author={Anna Gelencsér-Horváth and Gergely Dinya and Dorka Boglárka Erős and Péter Halász and Islam Muhammad Muqsit and Kristóf Karacs},
      year={2026},
      eprint={2602.15899},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.15899}, 
}

About

SceneVGGT is a spatio-temporal 3D scene understanding framework that combines SLAM with semantic mapping. It supports online, near-real-time processing of streamed data with fixed VRAM usage regardless of input length, making it well suited for online tasks such as autonomous and assistive navigation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors