Skip to content

XingruiWang/Spatial457

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spatial457 Logo

1Johns Hopkins University    2DEVCOM Army Research Laboratory

Project Page / Paper / Huggingface Data Card 🤗 / Code

Spatial457 Teaser

Official implementation of the CVPR 2025 (Highlight) paper:
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models

🧠 Introduction

Spatial457 is a diagnostic benchmark designed to evaluate the 6D spatial reasoning capabilities of large multimodal models (LMMs). It systematically introduces four key capabilities—multi-object understanding, 2D and 3D localization, and 3D orientation—across five difficulty levels and seven question types, progressing from basic recognition to complex physical interaction.

🚀 Quick Start

For Evaluation Only

If you just want to evaluate models on Spatial457:

# Install VLMEvalKit
git clone https://github.com/open-compass/VLMEvalKit
cd VLMEvalKit
pip install -e .

# Run evaluation
python run.py --data Spatial457 --model <model_name>

The dataset (images + questions) will be automatically downloaded from Hugging Face.

For Custom Dataset Generation

If you want to generate your own images and questions:

# 1. Clone this repository
git clone https://github.com/XingruiWang/Spatial457.git
cd Spatial457

# 2. Download 3D models and assets (required!)
cd image_generation
git clone https://huggingface.co/datasets/RyanWW/spatial457_meta_data
cp spatial457_meta_data/image_generation/data.zip .
unzip data.zip
cd ..

# 3. Install dependencies
pip install bpy==3.5.0
pip install -r requirement.txt

# 4. Generate images
bash scripts/render_images_realistic.sh

# 5. Generate questions
bash scripts/generate_questions.sh

📦 Download

Evaluation Dataset

You can access the full evaluation dataset and toolkit:

Generation Assets (Required for Custom Dataset Generation)

If you want to generate your own images and questions, download the 3D models and assets:

  • 3D Models & Assets (data.zip): Hugging Face - spatial457_meta_data
  • Size: ~2GB compressed, ~2.1GB uncompressed
  • Contents: CGPart 3D models, colored variants, textures, HDRI maps, Blender scenes

Quick download:

# Download 3D models and assets
cd image_generation
git clone https://huggingface.co/datasets/RyanWW/spatial457_meta_data
cp spatial457_meta_data/image_generation/data.zip .
unzip data.zip

🔥Run benchmark with VLMEvalKit.

Spatial457 is also support by VLMEvalKit! Please try here for quick evaluation on most of the VLM. Evaluation can be done be running run.py in VLMEvalKit:

python run.py --data Spatial457 --model <model_name>

🛠️ Dataset Generation

We use Blender to render the scenes, so you can also add customized objects into the dataset. We also support customizing your own question types/templates for your studies.

📥 Prerequisites: Download Required Data

Before generating images or questions, you need to download the required 3D models and assets:

Download data.zip

Download the data package from Hugging Face:

# Option 1: Using git clone (recommended)
cd image_generation
git clone https://huggingface.co/datasets/RyanWW/spatial457_meta_data
cp spatial457_meta_data/image_generation/data.zip .
unzip data.zip

# Option 2: Direct download
wget https://huggingface.co/datasets/RyanWW/spatial457_meta_data/resolve/main/image_generation/data.zip
unzip data.zip

What's in data.zip?

The data.zip file (~2GB compressed, ~2.1GB uncompressed) contains 2,188 files including:

data/
├── CGParts_colored/          # Colored 3D object models
│   ├── aeroplane/           # Aircraft models (various types)
│   ├── bicycle/             # Bicycle models
│   ├── bus/                 # Bus models (school, double, articulated)
│   ├── car/                 # Car models (various types)
│   └── motorbike/           # Motorbike models
├── CGPart/                   # Original CGPart dataset
│   ├── models/              # Base 3D models
│   ├── labels/              # Part annotations
│   ├── keypoints/           # Keypoint annotations
│   └── partobjs/            # Part-level objects
├── base_scene2.blend        # Base Blender scene file
├── HDRI_haven.json          # HDRI environment maps
├── colors.json              # Color definitions
├── properties_cgpart.json   # Object properties
└── save_models_1/           # Additional model data
    └── part_dict.json       # Part dictionary

Each object category contains:

  • models/: Normalized 3D models (.obj, .mtl, .binvox, .json)
  • Colored variants: 8 color versions (red, blue, yellow, green, cyan, purple, brown, gray)
  • images/: Texture files for realistic rendering

🎨 Generate Images

After downloading and extracting data.zip, you can generate images:

Installation

See image_generation/README.md for detailed setup instructions.

pip install bpy==3.5.0
pip install -r requirement.txt

Run Image Generation

bash scripts/render_images_realistic.sh

Note (Reproducing scenes): If you want to reproduce scenes from pre-generated data (i.e., using --load_scene 1), download spatial457_scenes_21k.json from the Hugging Face dataset, then set --clevr_scene_path in scripts/render_images_realistic.sh (lines 22–23) to the downloaded path:

# Download: huggingface-cli download RyanWW/Spatial457 spatial457_scenes_21k.json --repo-type dataset --local-dir output/
# In render_images_realistic.sh, ensure: --clevr_scene_path $SPATIAL457_DIR/output/spatial457_scenes_21k.json

This will generate:

  • Images: Rendered RGB images in output/ver_realistic/images/
  • Scene annotations: JSON files with 3D scene information in output/ver_realistic/scenes/
  • Scene metadata: Combined scene file output/ver_realistic/superCLEVR_scenes.json

❓ Generate Questions

After generating images and scene annotations, generate VQA questions:

Run Question Generation

Set input_scene_file to your scene annotation JSON file:

bash scripts/generate_questions.sh

This script will generate questions for all difficulty levels (L1-L5) and question types:

  • L1: Single object identification
  • L2: Multi-object understanding
  • L3: 2D spatial reasoning
  • L4: Object occlusion and 3D pose
  • L5: 6D spatial reasoning and collision detection

Output

Generated questions will be saved in JSON format with the following structure:

{
  "image_filename": "superCLEVR_new_000001.png",
  "question": "Is the large red object in front of the yellow car?",
  "answer": "True",
  "program": [...],
  "question_index": 100001
}

Citation

@inproceedings{wang2025spatial457,
  title     = {Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models},
  author    = {Wang, Xingrui and Ma, Wufei and Zhang, Tiezheng and de Melo, Celso M and Chen, Jieneng and Yuille, Alan},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2025},
  url       = {https://arxiv.org/abs/2502.08636}
}

Content and toolkit are actively being updated. Stay tuned!

About

[CVPR'25] A vision question answering (VQA) benchmark for 6D spatial reasoning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors