FlipSet Benchmark

A compact diagnostic benchmark for evaluating Level-2 Visual Perspective Taking (L2 VPT) in Vision-Language Models (VLMs).

Overview

FlipSet is a controlled benchmark designed to evaluate whether VLMs can perform 180-degree mental rotation from another agent's viewpoint. The task isolates L2 VPT from complex 3D spatial reasoning by using simple 2D string transformations.

Project Structure

flipset_benchmark/
├── data/                           # Data files
│   ├── raw/                        # Raw data
│   │   ├── control_experiment/     # Control experiment raw data
│   │   │   └── control_experiment_summary.json
│   │   └── main_experiment/        # Main experiment raw data
│   │       ├── main_experiment_items.csv
│   │       └── main_experiment_summary.json
│   └── processed/                  # Processed data
│       ├── control_experiment/     # Control experiment processed data
│       │   ├── l1_accuracy.csv
│       │   └── mental_rotation_accuracy.csv
│       └── main_experiment/        # Main experiment processed data
│           ├── l2_accuracy.csv
│           └── models_L2_error_type_summary.csv
├── code/                           # Python source code (executable modules)
│   ├── data_processing/            # Data processing Python scripts
│   │   ├── generate_models_summary.py
│   │   ├── generate_control_accuracies.py
│   │   └── generate_l2_accuracy.py
│   └── visualization/              # Figure generation Python scripts
│       ├── plot_figure2_error_type_distribution.py
│       ├── plot_figure3_control_experiment.py
│       ├── plot_figure4_egocentric_vs_confusable.py
│       └── plot_figureA1_error_curves_by_condition.py
├── figures/                        # Generated figures (PDF)
│   ├── figure2_error_type_distribution.pdf
│   ├── figure3_control_experiment.pdf
│   ├── figure4_egocentric_vs_confusable.pdf
│   └── figureA1_error_curves_by_condition.pdf
├── scripts/                        # Utility scripts (batch execution tools)
│   └── generate_all_figures.sh     # Shell script to run all visualization scripts
├── paper/                          # Paper LaTeX source
│   ├── paper.tex
│   └── figures/
├── requirements.txt                # Python dependencies
├── README.md                       # This file
└── FILE_MAPPING.md                 # File mapping documentation

Installation

Clone this repository:

git clone <repository-url>
cd flipset_benchmark

Install dependencies:

pip install -r requirements.txt

Quick Start

Generate All Figures

To generate all figures used in the paper:

./scripts/generate_all_figures.sh

Or run individual scripts:

# Figure 2: Error type distribution pie chart
python3 code/visualization/plot_figure2_error_type_distribution.py

# Figure 3: Control experiment plot
python3 code/visualization/plot_figure3_control_experiment.py

# Figure 4: Egocentric vs Confusable scatter plot
python3 code/visualization/plot_figure4_egocentric_vs_confusable.py

# Figure A1: Error curves by condition
python3 code/visualization/plot_figureA1_error_curves_by_condition.py

Data Processing

Process raw data into CSV format:

# Generate models summary from item-level data
python3 code/data_processing/generate_models_summary.py

# Generate control experiment accuracies (L1 and MR)
python3 code/data_processing/generate_control_accuracies.py

# Generate L2 accuracy from main experiment summary
python3 code/data_processing/generate_l2_accuracy.py

Experiment Design

Main Experiment vs Control Experiment

The benchmark consists of two complementary experiments with different model sets:

Main Experiment: Evaluates 103 models on Level-2 VPT (L2) tasks
Control Experiment: Evaluates 53 models on Level-1 VPT (L1) and Mental Rotation (MR) tasks

Model Overlap: Only 14 models are evaluated in all three tasks (L1, L2, and MR):

InternVL2_5-1B
InternVL2_5-2B
InternVL2_5-4B
InternVL2_5-8B
InternVL2_5-26B
InternVL2_5-38B
gemma-3-4b-it
gemma-3-12b-it
gemma-3-27b-it
llava-onevision-qwen2-0.5b-ov-hf
llava-onevision-qwen2-0.5b-si-hf
llava-onevision-qwen2-7b-ov-chat-hf
llava-onevision-qwen2-7b-ov-hf
llava-onevision-qwen2-7b-si-hf

Important Note: The control experiment analysis and Figure 3 (figure3_control_experiment.pdf) are based on these 14 overlapping models that have data for all three tasks (L1, L2, and MR). This allows for direct comparison and correlation analysis across the three cognitive tasks.

Data Files

Main Experiment Data

Processed Data:

data/processed/main_experiment/models_L2_error_type_summary.csv - Complete evaluation results for 103 models (overall accuracy, error rates, pipeline, parameter size)
data/processed/main_experiment/l2_accuracy.csv - Level 2 VPT accuracy for each model

Raw Data:

data/raw/main_experiment/main_experiment_items.csv - Item-level evaluation data (34,608 rows, 103 models)
data/raw/main_experiment/main_experiment_summary.json - Detailed summary in JSON format (model → eval-id → score)

Control Experiment Data

Processed Data:

data/processed/control_experiment/l1_accuracy.csv - Level 1 VPT accuracy for each model
data/processed/control_experiment/mental_rotation_accuracy.csv - Mental Rotation accuracy for each model

Raw Data:

data/raw/control_experiment/control_experiment_summary.json - Detailed summary in JSON format with L1 and MR task separation

Key Findings

91.3% of models perform below chance level (25%)
75.88% of errors are egocentric (models repeat front-facing text)
Chain-of-Thought reasoning fails to mitigate egocentric bias
L1 VPT shows high accuracy (92.9% mean), demonstrating models can understand visibility
Mental Rotation shows moderate accuracy (34.9% mean), indicating spatial reasoning capability
L2 VPT remains poor (8.4% mean), revealing the core challenge

Figures

All figures are generated in the figures/ directory:

figure2_error_type_distribution.pdf - Error type distribution pie chart (calculated from models_L2_error_type_summary.csv, all 103 models)
figure3_control_experiment.pdf - Control experiment results (L1, L2, MR performance and correlations, based on 14 overlapping models that have data for all three tasks)
figure4_egocentric_vs_confusable.pdf - Egocentric-Confusable trade-off scatter plot (all 103 models)
figureA1_error_curves_by_condition.pdf - Error rates across 12 counter-balanced conditions (all 103 models)

Scripts

Directory Structure

code/: Contains Python source code modules that can be run independently
- code/data_processing/: Data processing scripts (Python)
- code/visualization/: Figure generation scripts (Python)
scripts/: Contains utility scripts for batch execution and automation
- scripts/generate_all_figures.sh: Shell script that runs all visualization scripts in sequence

Data Processing Scripts (in `code/data_processing/`)

generate_models_summary.py - Generates models_L2_error_type_summary.csv from main_experiment_items.csv
generate_control_accuracies.py - Generates l1_accuracy.csv and mental_rotation_accuracy.csv from control_experiment_summary.json
generate_l2_accuracy.py - Generates l2_accuracy.csv from main_experiment_summary.json

Visualization Scripts (in `code/visualization/`)

plot_figure2_error_type_distribution.py - Generates Figure 2 (error type pie chart)
plot_figure3_control_experiment.py - Generates Figure 3 (control experiment results, automatically selects 14 overlapping models with data for L1, L2, and MR)
plot_figure4_egocentric_vs_confusable.py - Generates Figure 4 (egocentric vs confusable scatter plot)
plot_figureA1_error_curves_by_condition.py - Generates Figure A1 (error curves by condition)

Utility Scripts (in `scripts/`)

generate_all_figures.sh - Batch script to generate all figures at once (runs all visualization scripts in sequence)

Citation

If you use this benchmark, please cite:

@article{flipset2024,
  title={Egocentric Bias in Vision-Language Models},
  author={...},
  journal={...},
  year={2024}
}

License

[Specify your license here]

Contact

For questions or issues, please open an issue on GitHub or contact [your email].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlipSet Benchmark

Overview

Project Structure

Installation

Quick Start

Generate All Figures

Data Processing

Experiment Design

Main Experiment vs Control Experiment

Data Files

Main Experiment Data

Control Experiment Data

Key Findings

Figures

Scripts

Directory Structure

Data Processing Scripts (in `code/data_processing/`)

Visualization Scripts (in `code/visualization/`)

Utility Scripts (in `scripts/`)

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
code		code
data		data
figures		figures
paper		paper
scripts		scripts
.gitignore		.gitignore
FILE_MAPPING.md		FILE_MAPPING.md
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

FlipSet Benchmark

Overview

Project Structure

Installation

Quick Start

Generate All Figures

Data Processing

Experiment Design

Main Experiment vs Control Experiment

Data Files

Main Experiment Data

Control Experiment Data

Key Findings

Figures

Scripts

Directory Structure

Data Processing Scripts (in code/data_processing/)

Visualization Scripts (in code/visualization/)

Utility Scripts (in scripts/)

Citation

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Data Processing Scripts (in `code/data_processing/`)

Visualization Scripts (in `code/visualization/`)

Utility Scripts (in `scripts/`)

Packages