A compact diagnostic benchmark for evaluating Level-2 Visual Perspective Taking (L2 VPT) in Vision-Language Models (VLMs).
FlipSet is a controlled benchmark designed to evaluate whether VLMs can perform 180-degree mental rotation from another agent's viewpoint. The task isolates L2 VPT from complex 3D spatial reasoning by using simple 2D string transformations.
flipset_benchmark/
├── data/ # Data files
│ ├── raw/ # Raw data
│ │ ├── control_experiment/ # Control experiment raw data
│ │ │ └── control_experiment_summary.json
│ │ └── main_experiment/ # Main experiment raw data
│ │ ├── main_experiment_items.csv
│ │ └── main_experiment_summary.json
│ └── processed/ # Processed data
│ ├── control_experiment/ # Control experiment processed data
│ │ ├── l1_accuracy.csv
│ │ └── mental_rotation_accuracy.csv
│ └── main_experiment/ # Main experiment processed data
│ ├── l2_accuracy.csv
│ └── models_L2_error_type_summary.csv
├── code/ # Python source code (executable modules)
│ ├── data_processing/ # Data processing Python scripts
│ │ ├── generate_models_summary.py
│ │ ├── generate_control_accuracies.py
│ │ └── generate_l2_accuracy.py
│ └── visualization/ # Figure generation Python scripts
│ ├── plot_figure2_error_type_distribution.py
│ ├── plot_figure3_control_experiment.py
│ ├── plot_figure4_egocentric_vs_confusable.py
│ └── plot_figureA1_error_curves_by_condition.py
├── figures/ # Generated figures (PDF)
│ ├── figure2_error_type_distribution.pdf
│ ├── figure3_control_experiment.pdf
│ ├── figure4_egocentric_vs_confusable.pdf
│ └── figureA1_error_curves_by_condition.pdf
├── scripts/ # Utility scripts (batch execution tools)
│ └── generate_all_figures.sh # Shell script to run all visualization scripts
├── paper/ # Paper LaTeX source
│ ├── paper.tex
│ └── figures/
├── requirements.txt # Python dependencies
├── README.md # This file
└── FILE_MAPPING.md # File mapping documentation
- Clone this repository:
git clone <repository-url>
cd flipset_benchmark- Install dependencies:
pip install -r requirements.txtTo generate all figures used in the paper:
./scripts/generate_all_figures.shOr run individual scripts:
# Figure 2: Error type distribution pie chart
python3 code/visualization/plot_figure2_error_type_distribution.py
# Figure 3: Control experiment plot
python3 code/visualization/plot_figure3_control_experiment.py
# Figure 4: Egocentric vs Confusable scatter plot
python3 code/visualization/plot_figure4_egocentric_vs_confusable.py
# Figure A1: Error curves by condition
python3 code/visualization/plot_figureA1_error_curves_by_condition.pyProcess raw data into CSV format:
# Generate models summary from item-level data
python3 code/data_processing/generate_models_summary.py
# Generate control experiment accuracies (L1 and MR)
python3 code/data_processing/generate_control_accuracies.py
# Generate L2 accuracy from main experiment summary
python3 code/data_processing/generate_l2_accuracy.pyThe benchmark consists of two complementary experiments with different model sets:
- Main Experiment: Evaluates 103 models on Level-2 VPT (L2) tasks
- Control Experiment: Evaluates 53 models on Level-1 VPT (L1) and Mental Rotation (MR) tasks
Model Overlap: Only 14 models are evaluated in all three tasks (L1, L2, and MR):
- InternVL2_5-1B
- InternVL2_5-2B
- InternVL2_5-4B
- InternVL2_5-8B
- InternVL2_5-26B
- InternVL2_5-38B
- gemma-3-4b-it
- gemma-3-12b-it
- gemma-3-27b-it
- llava-onevision-qwen2-0.5b-ov-hf
- llava-onevision-qwen2-0.5b-si-hf
- llava-onevision-qwen2-7b-ov-chat-hf
- llava-onevision-qwen2-7b-ov-hf
- llava-onevision-qwen2-7b-si-hf
Important Note: The control experiment analysis and Figure 3 (figure3_control_experiment.pdf) are based on these 14 overlapping models that have data for all three tasks (L1, L2, and MR). This allows for direct comparison and correlation analysis across the three cognitive tasks.
Processed Data:
data/processed/main_experiment/models_L2_error_type_summary.csv- Complete evaluation results for 103 models (overall accuracy, error rates, pipeline, parameter size)data/processed/main_experiment/l2_accuracy.csv- Level 2 VPT accuracy for each model
Raw Data:
data/raw/main_experiment/main_experiment_items.csv- Item-level evaluation data (34,608 rows, 103 models)data/raw/main_experiment/main_experiment_summary.json- Detailed summary in JSON format (model → eval-id → score)
Processed Data:
data/processed/control_experiment/l1_accuracy.csv- Level 1 VPT accuracy for each modeldata/processed/control_experiment/mental_rotation_accuracy.csv- Mental Rotation accuracy for each model
Raw Data:
data/raw/control_experiment/control_experiment_summary.json- Detailed summary in JSON format with L1 and MR task separation
- 91.3% of models perform below chance level (25%)
- 75.88% of errors are egocentric (models repeat front-facing text)
- Chain-of-Thought reasoning fails to mitigate egocentric bias
- L1 VPT shows high accuracy (92.9% mean), demonstrating models can understand visibility
- Mental Rotation shows moderate accuracy (34.9% mean), indicating spatial reasoning capability
- L2 VPT remains poor (8.4% mean), revealing the core challenge
All figures are generated in the figures/ directory:
figure2_error_type_distribution.pdf- Error type distribution pie chart (calculated from models_L2_error_type_summary.csv, all 103 models)figure3_control_experiment.pdf- Control experiment results (L1, L2, MR performance and correlations, based on 14 overlapping models that have data for all three tasks)figure4_egocentric_vs_confusable.pdf- Egocentric-Confusable trade-off scatter plot (all 103 models)figureA1_error_curves_by_condition.pdf- Error rates across 12 counter-balanced conditions (all 103 models)
-
code/: Contains Python source code modules that can be run independentlycode/data_processing/: Data processing scripts (Python)code/visualization/: Figure generation scripts (Python)
-
scripts/: Contains utility scripts for batch execution and automationscripts/generate_all_figures.sh: Shell script that runs all visualization scripts in sequence
generate_models_summary.py- Generatesmodels_L2_error_type_summary.csvfrommain_experiment_items.csvgenerate_control_accuracies.py- Generatesl1_accuracy.csvandmental_rotation_accuracy.csvfromcontrol_experiment_summary.jsongenerate_l2_accuracy.py- Generatesl2_accuracy.csvfrommain_experiment_summary.json
plot_figure2_error_type_distribution.py- Generates Figure 2 (error type pie chart)plot_figure3_control_experiment.py- Generates Figure 3 (control experiment results, automatically selects 14 overlapping models with data for L1, L2, and MR)plot_figure4_egocentric_vs_confusable.py- Generates Figure 4 (egocentric vs confusable scatter plot)plot_figureA1_error_curves_by_condition.py- Generates Figure A1 (error curves by condition)
generate_all_figures.sh- Batch script to generate all figures at once (runs all visualization scripts in sequence)
If you use this benchmark, please cite:
@article{flipset2024,
title={Egocentric Bias in Vision-Language Models},
author={...},
journal={...},
year={2024}
}[Specify your license here]
For questions or issues, please open an issue on GitHub or contact [your email].