1Johns Hopkins University
2DEVCOM Army Research Laboratory
Project Page / Paper / Huggingface Data Card 🤗 / Code
Official implementation of the CVPR 2025 (Highlight) paper:
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models
Spatial457 is a diagnostic benchmark designed to evaluate the 6D spatial reasoning capabilities of large multimodal models (LMMs). It systematically introduces four key capabilities—multi-object understanding, 2D and 3D localization, and 3D orientation—across five difficulty levels and seven question types, progressing from basic recognition to complex physical interaction.
If you just want to evaluate models on Spatial457:
# Install VLMEvalKit
git clone https://github.com/open-compass/VLMEvalKit
cd VLMEvalKit
pip install -e .
# Run evaluation
python run.py --data Spatial457 --model <model_name>The dataset (images + questions) will be automatically downloaded from Hugging Face.
If you want to generate your own images and questions:
# 1. Clone this repository
git clone https://github.com/XingruiWang/Spatial457.git
cd Spatial457
# 2. Download 3D models and assets (required!)
cd image_generation
git clone https://huggingface.co/datasets/RyanWW/spatial457_meta_data
cp spatial457_meta_data/image_generation/data.zip .
unzip data.zip
cd ..
# 3. Install dependencies
pip install bpy==3.5.0
pip install -r requirement.txt
# 4. Generate images
bash scripts/render_images_realistic.sh
# 5. Generate questions
bash scripts/generate_questions.shYou can access the full evaluation dataset and toolkit:
- Dataset (Questions & Images): Hugging Face - Spatial457
- Code: GitHub Repository
- Paper: arXiv 2502.08636
If you want to generate your own images and questions, download the 3D models and assets:
- 3D Models & Assets (data.zip): Hugging Face - spatial457_meta_data
- Size: ~2GB compressed, ~2.1GB uncompressed
- Contents: CGPart 3D models, colored variants, textures, HDRI maps, Blender scenes
Quick download:
# Download 3D models and assets
cd image_generation
git clone https://huggingface.co/datasets/RyanWW/spatial457_meta_data
cp spatial457_meta_data/image_generation/data.zip .
unzip data.zip🔥Run benchmark with VLMEvalKit.
Spatial457 is also support by VLMEvalKit! Please try here for quick evaluation on most of the VLM. Evaluation can be done be running run.py in VLMEvalKit:
python run.py --data Spatial457 --model <model_name>
We use Blender to render the scenes, so you can also add customized objects into the dataset. We also support customizing your own question types/templates for your studies.
Before generating images or questions, you need to download the required 3D models and assets:
Download the data package from Hugging Face:
# Option 1: Using git clone (recommended)
cd image_generation
git clone https://huggingface.co/datasets/RyanWW/spatial457_meta_data
cp spatial457_meta_data/image_generation/data.zip .
unzip data.zip
# Option 2: Direct download
wget https://huggingface.co/datasets/RyanWW/spatial457_meta_data/resolve/main/image_generation/data.zip
unzip data.zipThe data.zip file (~2GB compressed, ~2.1GB uncompressed) contains 2,188 files including:
data/
├── CGParts_colored/ # Colored 3D object models
│ ├── aeroplane/ # Aircraft models (various types)
│ ├── bicycle/ # Bicycle models
│ ├── bus/ # Bus models (school, double, articulated)
│ ├── car/ # Car models (various types)
│ └── motorbike/ # Motorbike models
├── CGPart/ # Original CGPart dataset
│ ├── models/ # Base 3D models
│ ├── labels/ # Part annotations
│ ├── keypoints/ # Keypoint annotations
│ └── partobjs/ # Part-level objects
├── base_scene2.blend # Base Blender scene file
├── HDRI_haven.json # HDRI environment maps
├── colors.json # Color definitions
├── properties_cgpart.json # Object properties
└── save_models_1/ # Additional model data
└── part_dict.json # Part dictionary
Each object category contains:
- models/: Normalized 3D models (.obj, .mtl, .binvox, .json)
- Colored variants: 8 color versions (red, blue, yellow, green, cyan, purple, brown, gray)
- images/: Texture files for realistic rendering
After downloading and extracting data.zip, you can generate images:
See image_generation/README.md for detailed setup instructions.
pip install bpy==3.5.0
pip install -r requirement.txtbash scripts/render_images_realistic.shNote (Reproducing scenes): If you want to reproduce scenes from pre-generated data (i.e., using --load_scene 1), download spatial457_scenes_21k.json from the Hugging Face dataset, then set --clevr_scene_path in scripts/render_images_realistic.sh (lines 22–23) to the downloaded path:
# Download: huggingface-cli download RyanWW/Spatial457 spatial457_scenes_21k.json --repo-type dataset --local-dir output/
# In render_images_realistic.sh, ensure: --clevr_scene_path $SPATIAL457_DIR/output/spatial457_scenes_21k.jsonThis will generate:
- Images: Rendered RGB images in
output/ver_realistic/images/ - Scene annotations: JSON files with 3D scene information in
output/ver_realistic/scenes/ - Scene metadata: Combined scene file
output/ver_realistic/superCLEVR_scenes.json
After generating images and scene annotations, generate VQA questions:
Set input_scene_file to your scene annotation JSON file:
bash scripts/generate_questions.shThis script will generate questions for all difficulty levels (L1-L5) and question types:
- L1: Single object identification
- L2: Multi-object understanding
- L3: 2D spatial reasoning
- L4: Object occlusion and 3D pose
- L5: 6D spatial reasoning and collision detection
Generated questions will be saved in JSON format with the following structure:
{
"image_filename": "superCLEVR_new_000001.png",
"question": "Is the large red object in front of the yellow car?",
"answer": "True",
"program": [...],
"question_index": 100001
}@inproceedings{wang2025spatial457,
title = {Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models},
author = {Wang, Xingrui and Ma, Wufei and Zhang, Tiezheng and de Melo, Celso M and Chen, Jieneng and Yuille, Alan},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2025},
url = {https://arxiv.org/abs/2502.08636}
}
Content and toolkit are actively being updated. Stay tuned!
