Skip to content

grow-ai-like-a-child/Flipset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FlipSet Benchmark

A compact diagnostic benchmark for evaluating Level-2 Visual Perspective Taking (L2 VPT) in Vision-Language Models (VLMs).

Overview

FlipSet is a controlled benchmark designed to evaluate whether VLMs can perform 180-degree mental rotation from another agent's viewpoint. The task isolates L2 VPT from complex 3D spatial reasoning by using simple 2D string transformations.

Project Structure

flipset_benchmark/
├── data/                           # Data files
│   ├── raw/                        # Raw data
│   │   ├── control_experiment/     # Control experiment raw data
│   │   │   └── control_experiment_summary.json
│   │   └── main_experiment/        # Main experiment raw data
│   │       ├── main_experiment_items.csv
│   │       └── main_experiment_summary.json
│   └── processed/                  # Processed data
│       ├── control_experiment/     # Control experiment processed data
│       │   ├── l1_accuracy.csv
│       │   └── mental_rotation_accuracy.csv
│       └── main_experiment/        # Main experiment processed data
│           ├── l2_accuracy.csv
│           └── models_L2_error_type_summary.csv
├── code/                           # Python source code (executable modules)
│   ├── data_processing/            # Data processing Python scripts
│   │   ├── generate_models_summary.py
│   │   ├── generate_control_accuracies.py
│   │   └── generate_l2_accuracy.py
│   └── visualization/              # Figure generation Python scripts
│       ├── plot_figure2_error_type_distribution.py
│       ├── plot_figure3_control_experiment.py
│       ├── plot_figure4_egocentric_vs_confusable.py
│       └── plot_figureA1_error_curves_by_condition.py
├── figures/                        # Generated figures (PDF)
│   ├── figure2_error_type_distribution.pdf
│   ├── figure3_control_experiment.pdf
│   ├── figure4_egocentric_vs_confusable.pdf
│   └── figureA1_error_curves_by_condition.pdf
├── scripts/                        # Utility scripts (batch execution tools)
│   └── generate_all_figures.sh     # Shell script to run all visualization scripts
├── paper/                          # Paper LaTeX source
│   ├── paper.tex
│   └── figures/
├── requirements.txt                # Python dependencies
├── README.md                       # This file
└── FILE_MAPPING.md                 # File mapping documentation

Installation

  1. Clone this repository:
git clone <repository-url>
cd flipset_benchmark
  1. Install dependencies:
pip install -r requirements.txt

Quick Start

Generate All Figures

To generate all figures used in the paper:

./scripts/generate_all_figures.sh

Or run individual scripts:

# Figure 2: Error type distribution pie chart
python3 code/visualization/plot_figure2_error_type_distribution.py

# Figure 3: Control experiment plot
python3 code/visualization/plot_figure3_control_experiment.py

# Figure 4: Egocentric vs Confusable scatter plot
python3 code/visualization/plot_figure4_egocentric_vs_confusable.py

# Figure A1: Error curves by condition
python3 code/visualization/plot_figureA1_error_curves_by_condition.py

Data Processing

Process raw data into CSV format:

# Generate models summary from item-level data
python3 code/data_processing/generate_models_summary.py

# Generate control experiment accuracies (L1 and MR)
python3 code/data_processing/generate_control_accuracies.py

# Generate L2 accuracy from main experiment summary
python3 code/data_processing/generate_l2_accuracy.py

Experiment Design

Main Experiment vs Control Experiment

The benchmark consists of two complementary experiments with different model sets:

  • Main Experiment: Evaluates 103 models on Level-2 VPT (L2) tasks
  • Control Experiment: Evaluates 53 models on Level-1 VPT (L1) and Mental Rotation (MR) tasks

Model Overlap: Only 14 models are evaluated in all three tasks (L1, L2, and MR):

  1. InternVL2_5-1B
  2. InternVL2_5-2B
  3. InternVL2_5-4B
  4. InternVL2_5-8B
  5. InternVL2_5-26B
  6. InternVL2_5-38B
  7. gemma-3-4b-it
  8. gemma-3-12b-it
  9. gemma-3-27b-it
  10. llava-onevision-qwen2-0.5b-ov-hf
  11. llava-onevision-qwen2-0.5b-si-hf
  12. llava-onevision-qwen2-7b-ov-chat-hf
  13. llava-onevision-qwen2-7b-ov-hf
  14. llava-onevision-qwen2-7b-si-hf

Important Note: The control experiment analysis and Figure 3 (figure3_control_experiment.pdf) are based on these 14 overlapping models that have data for all three tasks (L1, L2, and MR). This allows for direct comparison and correlation analysis across the three cognitive tasks.

Data Files

Main Experiment Data

Processed Data:

  • data/processed/main_experiment/models_L2_error_type_summary.csv - Complete evaluation results for 103 models (overall accuracy, error rates, pipeline, parameter size)
  • data/processed/main_experiment/l2_accuracy.csv - Level 2 VPT accuracy for each model

Raw Data:

  • data/raw/main_experiment/main_experiment_items.csv - Item-level evaluation data (34,608 rows, 103 models)
  • data/raw/main_experiment/main_experiment_summary.json - Detailed summary in JSON format (model → eval-id → score)

Control Experiment Data

Processed Data:

  • data/processed/control_experiment/l1_accuracy.csv - Level 1 VPT accuracy for each model
  • data/processed/control_experiment/mental_rotation_accuracy.csv - Mental Rotation accuracy for each model

Raw Data:

  • data/raw/control_experiment/control_experiment_summary.json - Detailed summary in JSON format with L1 and MR task separation

Key Findings

  • 91.3% of models perform below chance level (25%)
  • 75.88% of errors are egocentric (models repeat front-facing text)
  • Chain-of-Thought reasoning fails to mitigate egocentric bias
  • L1 VPT shows high accuracy (92.9% mean), demonstrating models can understand visibility
  • Mental Rotation shows moderate accuracy (34.9% mean), indicating spatial reasoning capability
  • L2 VPT remains poor (8.4% mean), revealing the core challenge

Figures

All figures are generated in the figures/ directory:

  • figure2_error_type_distribution.pdf - Error type distribution pie chart (calculated from models_L2_error_type_summary.csv, all 103 models)
  • figure3_control_experiment.pdf - Control experiment results (L1, L2, MR performance and correlations, based on 14 overlapping models that have data for all three tasks)
  • figure4_egocentric_vs_confusable.pdf - Egocentric-Confusable trade-off scatter plot (all 103 models)
  • figureA1_error_curves_by_condition.pdf - Error rates across 12 counter-balanced conditions (all 103 models)

Scripts

Directory Structure

  • code/: Contains Python source code modules that can be run independently

    • code/data_processing/: Data processing scripts (Python)
    • code/visualization/: Figure generation scripts (Python)
  • scripts/: Contains utility scripts for batch execution and automation

    • scripts/generate_all_figures.sh: Shell script that runs all visualization scripts in sequence

Data Processing Scripts (in code/data_processing/)

  • generate_models_summary.py - Generates models_L2_error_type_summary.csv from main_experiment_items.csv
  • generate_control_accuracies.py - Generates l1_accuracy.csv and mental_rotation_accuracy.csv from control_experiment_summary.json
  • generate_l2_accuracy.py - Generates l2_accuracy.csv from main_experiment_summary.json

Visualization Scripts (in code/visualization/)

  • plot_figure2_error_type_distribution.py - Generates Figure 2 (error type pie chart)
  • plot_figure3_control_experiment.py - Generates Figure 3 (control experiment results, automatically selects 14 overlapping models with data for L1, L2, and MR)
  • plot_figure4_egocentric_vs_confusable.py - Generates Figure 4 (egocentric vs confusable scatter plot)
  • plot_figureA1_error_curves_by_condition.py - Generates Figure A1 (error curves by condition)

Utility Scripts (in scripts/)

  • generate_all_figures.sh - Batch script to generate all figures at once (runs all visualization scripts in sequence)

Citation

If you use this benchmark, please cite:

@article{flipset2024,
  title={Egocentric Bias in Vision-Language Models},
  author={...},
  journal={...},
  year={2024}
}

License

[Specify your license here]

Contact

For questions or issues, please open an issue on GitHub or contact [your email].

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors