SEAM: Semantically Equivalent Across Modalities Benchmark

High-performance evaluation pipeline for the SEAM benchmark for Vision-Language Models.

Overview

SEAM addresses fundamental limitations in existing benchmarks through its utilization of distinct notation systems and preservation of semantic equivalence across modalities. By leveraging domain-specific standardized representations in:

Chess: Board images vs. FEN strings
Chemistry: Structural diagrams vs. SMILES strings
Music: Staff images vs. ABC notations
Graph Theory: Node-edge diagrams vs. adjacency matrices

SEAM presents both visual-spatial and textual-symbolic representations while maintaining semantic equivalence. The benchmark comprises 16 carefully calibrated tasks designed to be self-contained in both modalities with 3,200 four-way multiple-choice questions in total.

Features

High Concurrency: Producer-worker pattern with 128 concurrent workers by default
Retry Mechanism: Automatic retry with exponential backoff for connection errors
Progress Tracking: Real-time progress bar with ETA using tqdm
Robust Answer Extraction: LLM-based extraction with regex fallback
Flexible Plotting: Multiple plot types for comprehensive analysis
Model-Specific Organization: Results saved in results/{model_name}/ directories

Setup

# Install dependencies
pip install -r requirements.txt

# Or use the setup script
./setup.sh

# Set environment variables for batch APIs (optional)
export AZURE_OPENAI_API_KEY=your_azure_key      # For OpenAI batch
export AZURE_OPENAI_ENDPOINT=your_endpoint      # For OpenAI batch
export ANTHROPIC_API_KEY=your_anthropic_key     # For Claude batch

Quick Start

Three Evaluation Pipelines

1. Standard Evaluation (OpenAI-API Compatible Servers)

For models running on OpenAI-compatible API servers:

python run.py --model-name "InternVL3-14B" --model-urls "192.168.55.245:6000"

2. OpenAI Azure Batch API

For GPT-4V and other Azure OpenAI models:

python run_batch.py --provider openai --action all --model gpt-4-vision

3. Claude Batch API

For Anthropic Claude models:

python run_batch.py --provider claude --model claude-3-5-sonnet-20241022

Advanced Usage

Load Balancing with Multiple Servers

python run.py --model-name "YourModel" \
              --model-urls "host1:port1 host2:port2" \
              --use-llm-extraction \
              --max-concurrency 256

Batch Processing Step-by-Step

# OpenAI: Upload, Submit, Retrieve
python run_batch.py --provider openai --action upload --model gpt-4-vision
python run_batch.py --provider openai --action submit --model gpt-4-vision  
python run_batch.py --provider openai --action retrieve --model gpt-4-vision

# Claude: Different models and modes
python run_batch.py --provider claude --model claude-3-5-haiku-20241022 --mode l  # Language only
python run_batch.py --provider claude --model claude-3-5-sonnet-20241022 --mode v # Vision only

Dataset Generation

To generate the SEAM benchmark dataset manually, run the following scripts from the chess-bench/code/ directory:

cd chess-bench/code/

# Generate Chemistry tasks
python dataset_chem.py

# Generate Chess tasks
python dataset_chess.py

# Generate Graph Theory tasks
python dataset_graph.py

# Generate Music tasks
python dataset_music.py

Each script will generate task-specific data, images, and question files in the chess-bench/data/benchmark/ directory. You can also directly download the pre-generated dataset from this link and unzip it under that directory.

Generate Plots

# Basic plots (domains and heatmap)
./plot.sh

# Advanced comparison plots
python3 plot_comparison.py --plot-type all

# Specific plot types
python3 generate_plots.py --plot-type domains
python3 plot_comparison.py --plot-type task-heatmap --models InternVL3-8B InternVL3-14B

Directory Structure

seam-benchmark/
├── run.py               # Main entry point for standard evaluation
├── run_batch.py         # Batch API runner for OpenAI/Claude
├── eval_model.sh        # Shell evaluation script
├── config.py            # Configuration settings
├── src/
│   ├── core/
│   │   ├── eval_pipeline.py  # Core evaluation pipeline
│   │   ├── task_loader.py    # Task loading utilities
│   │   └── vlm.py            # VLM interfaces
│   ├── utils/
│   │   ├── util.py           # VLM completion utilities
│   │   └── openai_util.py   # OpenAI-style API utilities
│   └── visualization/
│       ├── generate_plots.py        # Basic plotting
│       ├── generate_combined_plots.py # Combined plots
│       └── generate_latex_table.py  # LaTeX table generation
├── chess-bench/         # Dataset and generation code
│   ├── data/            # Benchmark datasets
│   └── code/            # Dataset generation scripts
├── results/             # Model evaluation results
└── plots/               # Generated visualization
│   ├── InternVL3-8B/
│   │   ├── results.jsonl
│   │   ├── results.csv
│   │   └── stats.json
│   └── {model_name}/
└── plots/               # Generated plots
    ├── domains.pdf
    ├── heatmap.pdf
    ├── model_comparison.pdf
    └── error_analysis.pdf

Evaluation Details

Tasks and Domains

The SEAM benchmark includes 16 tasks across 4 domains:

Chess: fork, legal, puzzle, eval
Chemistry: carbon, hydrogen, weight, caption
Music: notes, measures, forms, rhythm
Graph: path_counting, path_existence, shortest_path, bfs_traversal

Modalities

Each task is evaluated in 3 modalities:

L (Language-only): Text-only input using standardized notations
V (Vision-only): Image-only input with visual representations
VL (Vision-Language): Combined text and image input

Answer Extraction

The pipeline uses a two-stage extraction process:

Regex Extraction: Fast pattern matching for common answer formats
LLM Extraction: Fallback using GPT-4o-mini for ambiguous outputs

Configuration

Server Configuration

The evaluation pipeline uses two separate servers:

Model Evaluation Server: 192.168.55.245 (for evaluating VLM models)
Extraction Server: 192.168.55.244 (for LLM-based answer extraction)

Environment Variables

# Model configuration
export MODEL_PORT=6000        # Port on 192.168.55.245
export EXTRACTION_PORT=6001   # Port on 192.168.55.244

# Worker configuration  
export NUM_WORKERS=128
export MAX_RETRIES=3

Command Line Options

# eval_model.sh options
--model-port PORT         # Model API port (default: 6000)
--extraction-port PORT    # Extraction model port (default: 6001)  
--num-workers N          # Number of concurrent workers (default: 128)
--use-llm-extraction     # Use LLM extraction (default: false)
--tasks TASKS            # Comma-separated task names (default: all)
--modes MODES            # Comma-separated modes: l,v,vl (default: all)

# Plotting options
--results-dir DIR        # Results directory (default: results)
--output-dir DIR         # Output directory for plots (default: plots)
--models MODEL1 MODEL2   # Specific models to plot
--plot-type TYPE         # Plot type: domains, heatmap, comparison, etc.

Handling Timeouts and Connection Errors

Timeout Configuration

The pipeline includes configurable timeouts to handle slow requests:

Model Request Timeout: 120 seconds (default)
- For VLM model evaluation requests
- Can be adjusted with --request-timeout flag
Extraction Timeout: 30 seconds (default)
- For LLM answer extraction requests
- Can be adjusted with --extraction-timeout flag

Example timeout output:

Request timeout on attempt 1/3 (timeout=120s): The operation timed out
Retrying in 1 seconds...

Retry Logic

The pipeline includes automatic retry logic for both timeouts and connection errors:

Exponential Backoff: Retries with delays of 1s, 2s, 4s
Max Retries: 3 attempts (configurable)
Error Logging: Detailed error messages with worker ID and task info

Example error output:

Worker 48 error processing task 5360 (chess_fork, mode=vl): Connection error
Connection error on attempt 1/3: Connection refused
Retrying in 1 seconds...

Adjusting Timeouts

For slow models or overloaded servers, increase the timeout:

# Increase timeout to 300 seconds (5 minutes)
./eval_model.sh InternVL3-8B --request-timeout 300

# Different timeouts for model and extraction
./eval_model.sh InternVL3-8B --request-timeout 300 --extraction-timeout 60

Performance Optimization

Concurrent Workers: Adjust --num-workers based on your system
Batch Processing: Tasks are processed in parallel batches
Progress Tracking: Real-time ETA helps estimate completion time
Resource Usage: Each worker maintains its own connection pool

Troubleshooting

Common Issues

Connection Errors: Check if model server is running on specified port
Memory Issues: Reduce number of workers if OOM errors occur
Extraction Failures: Enable --use-llm-extraction for better accuracy
Missing Images: Ensure chess-bench dataset is properly downloaded

Debug Mode

# Run with detailed logging
python3 eval_pipeline.py --model InternVL3-8B --debug

# Test single task
python3 eval_pipeline.py --model InternVL3-8B --tasks fork --modes l

Citation

If you use this evaluation pipeline, please cite the original SEAM benchmark paper.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
chess-bench		chess-bench
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
eval_model.sh		eval_model.sh
examples.md		examples.md
requirements.txt		requirements.txt
run.py		run.py
run_batch.py		run_batch.py
setup.sh		setup.sh

License

CSSLab/SEAM

Folders and files

Latest commit

History

Repository files navigation