- π Ego3D-Bench: A benchmark of 8,600+ human-verified QA pairs for evaluating VLMs in ego-centric, multi-view outdoor environments.
- π§ Ego3D-VLM: A post-training framework that builds cognitive maps from global 3D coordinates, achieving +12% QA accuracy and +56% distance estimation improvements.
- π Impact: Together, Ego3D-Bench and Ego3D-VLM move VLMs closer to human-level 3D spatial understanding in real-world settings.
Benchmark Overview: We introduce Ego3D-Bench, a benchmark designed to evaluate the spatial understanding of VLMs in ego-centric multi-view scenarios. Images are collected from three different datasets: NuScenes, Argoverse, and Waymo. Questions are designed to require cross-view reseasoning. We define question from the ego-perspective and from the perspective of objects in the scene. To clearly indicate the perspective of each question, we categorize them into ego-centric or object-centric. In total we have 10 questions: 8 multi-choice QAs and 2 exact number QAs. Figure
Ego3D-VLM is a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates of object in the input prompt and create a textual cognitive map baed on the input images and the question. This approach results in 12% average improvement on multi-choice QA and 56% average improvement on absolute distance estimation.