Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-view Scenes (Code Comes Soon!)

Benchmarking and Improving 3D Spatial Reasoning in Vision-Language Models

📌 Key Highlights

📊 Ego3D-Bench: A benchmark of 8,600+ human-verified QA pairs for evaluating VLMs in ego-centric, multi-view outdoor environments.
🧠 Ego3D-VLM: A post-training framework that builds cognitive maps from global 3D coordinates, achieving +12% QA accuracy and +56% distance estimation improvements.
🚀 Impact: Together, Ego3D-Bench and Ego3D-VLM move VLMs closer to human-level 3D spatial understanding in real-world settings.

⚖️ Ego3D-Bench

Benchmark Overview: We introduce Ego3D-Bench, a benchmark designed to evaluate the spatial understanding of VLMs in ego-centric multi-view scenarios. Images are collected from three different datasets: NuScenes, Argoverse, and Waymo. Questions are designed to require cross-view reseasoning. We define question from the ego-perspective and from the perspective of objects in the scene. To clearly indicate the perspective of each question, we categorize them into ego-centric or object-centric. In total we have 10 questions: 8 multi-choice QAs and 2 exact number QAs. Figure

🧠 Ego3D-VLM

Ego3D-VLM is a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates of object in the input prompt and create a textual cognitive map baed on the input images and the question. This approach results in 12% average improvement on multi-choice QA and 56% average improvement on absolute distance estimation.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
figs		figs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-view Scenes (Code Comes Soon!)

📌 Key Highlights

⚖️ Ego3D-Bench

🧠 Ego3D-VLM

📊 Results

About

Uh oh!

Releases

Packages

Languages

vbdi/Ego3D-Bench

Folders and files

Latest commit

History

Repository files navigation

Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-view Scenes (Code Comes Soon!)

📌 Key Highlights

⚖️ Ego3D-Bench

🧠 Ego3D-VLM

📊 Results

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages