Skip to content

CSSLab/SEAM

Repository files navigation

SEAM: Semantically Equivalent Across Modalities Benchmark

High-performance evaluation pipeline for the SEAM benchmark for Vision-Language Models.

Overview

SEAM addresses fundamental limitations in existing benchmarks through its utilization of distinct notation systems and preservation of semantic equivalence across modalities. By leveraging domain-specific standardized representations in:

  • Chess: Board images vs. FEN strings
  • Chemistry: Structural diagrams vs. SMILES strings
  • Music: Staff images vs. ABC notations
  • Graph Theory: Node-edge diagrams vs. adjacency matrices

SEAM presents both visual-spatial and textual-symbolic representations while maintaining semantic equivalence. The benchmark comprises 16 carefully calibrated tasks designed to be self-contained in both modalities with 3,200 four-way multiple-choice questions in total.

Features

  • High Concurrency: Producer-worker pattern with 128 concurrent workers by default
  • Retry Mechanism: Automatic retry with exponential backoff for connection errors
  • Progress Tracking: Real-time progress bar with ETA using tqdm
  • Robust Answer Extraction: LLM-based extraction with regex fallback
  • Flexible Plotting: Multiple plot types for comprehensive analysis
  • Model-Specific Organization: Results saved in results/{model_name}/ directories

Setup

# Install dependencies
pip install -r requirements.txt

# Or use the setup script
./setup.sh

# Set environment variables for batch APIs (optional)
export AZURE_OPENAI_API_KEY=your_azure_key      # For OpenAI batch
export AZURE_OPENAI_ENDPOINT=your_endpoint      # For OpenAI batch
export ANTHROPIC_API_KEY=your_anthropic_key     # For Claude batch

Quick Start

Three Evaluation Pipelines

1. Standard Evaluation (OpenAI-API Compatible Servers)

For models running on OpenAI-compatible API servers:

python run.py --model-name "InternVL3-14B" --model-urls "192.168.55.245:6000"

2. OpenAI Azure Batch API

For GPT-4V and other Azure OpenAI models:

python run_batch.py --provider openai --action all --model gpt-4-vision

3. Claude Batch API

For Anthropic Claude models:

python run_batch.py --provider claude --model claude-3-5-sonnet-20241022

Advanced Usage

Load Balancing with Multiple Servers

python run.py --model-name "YourModel" \
              --model-urls "host1:port1 host2:port2" \
              --use-llm-extraction \
              --max-concurrency 256

Batch Processing Step-by-Step

# OpenAI: Upload, Submit, Retrieve
python run_batch.py --provider openai --action upload --model gpt-4-vision
python run_batch.py --provider openai --action submit --model gpt-4-vision  
python run_batch.py --provider openai --action retrieve --model gpt-4-vision

# Claude: Different models and modes
python run_batch.py --provider claude --model claude-3-5-haiku-20241022 --mode l  # Language only
python run_batch.py --provider claude --model claude-3-5-sonnet-20241022 --mode v # Vision only

Dataset Generation

To generate the SEAM benchmark dataset manually, run the following scripts from the chess-bench/code/ directory:

cd chess-bench/code/

# Generate Chemistry tasks
python dataset_chem.py

# Generate Chess tasks
python dataset_chess.py

# Generate Graph Theory tasks
python dataset_graph.py

# Generate Music tasks
python dataset_music.py

Each script will generate task-specific data, images, and question files in the chess-bench/data/benchmark/ directory. You can also directly download the pre-generated dataset from this link and unzip it under that directory.

Generate Plots

# Basic plots (domains and heatmap)
./plot.sh

# Advanced comparison plots
python3 plot_comparison.py --plot-type all

# Specific plot types
python3 generate_plots.py --plot-type domains
python3 plot_comparison.py --plot-type task-heatmap --models InternVL3-8B InternVL3-14B

Directory Structure

seam-benchmark/
├── run.py               # Main entry point for standard evaluation
├── run_batch.py         # Batch API runner for OpenAI/Claude
├── eval_model.sh        # Shell evaluation script
├── config.py            # Configuration settings
├── src/
│   ├── core/
│   │   ├── eval_pipeline.py  # Core evaluation pipeline
│   │   ├── task_loader.py    # Task loading utilities
│   │   └── vlm.py            # VLM interfaces
│   ├── utils/
│   │   ├── util.py           # VLM completion utilities
│   │   └── openai_util.py   # OpenAI-style API utilities
│   └── visualization/
│       ├── generate_plots.py        # Basic plotting
│       ├── generate_combined_plots.py # Combined plots
│       └── generate_latex_table.py  # LaTeX table generation
├── chess-bench/         # Dataset and generation code
│   ├── data/            # Benchmark datasets
│   └── code/            # Dataset generation scripts
├── results/             # Model evaluation results
└── plots/               # Generated visualization
│   ├── InternVL3-8B/
│   │   ├── results.jsonl
│   │   ├── results.csv
│   │   └── stats.json
│   └── {model_name}/
└── plots/               # Generated plots
    ├── domains.pdf
    ├── heatmap.pdf
    ├── model_comparison.pdf
    └── error_analysis.pdf

Evaluation Details

Tasks and Domains

The SEAM benchmark includes 16 tasks across 4 domains:

  • Chess: fork, legal, puzzle, eval
  • Chemistry: carbon, hydrogen, weight, caption
  • Music: notes, measures, forms, rhythm
  • Graph: path_counting, path_existence, shortest_path, bfs_traversal

Modalities

Each task is evaluated in 3 modalities:

  • L (Language-only): Text-only input using standardized notations
  • V (Vision-only): Image-only input with visual representations
  • VL (Vision-Language): Combined text and image input

Answer Extraction

The pipeline uses a two-stage extraction process:

  1. Regex Extraction: Fast pattern matching for common answer formats
  2. LLM Extraction: Fallback using GPT-4o-mini for ambiguous outputs

Configuration

Server Configuration

The evaluation pipeline uses two separate servers:

  • Model Evaluation Server: 192.168.55.245 (for evaluating VLM models)
  • Extraction Server: 192.168.55.244 (for LLM-based answer extraction)

Environment Variables

# Model configuration
export MODEL_PORT=6000        # Port on 192.168.55.245
export EXTRACTION_PORT=6001   # Port on 192.168.55.244

# Worker configuration  
export NUM_WORKERS=128
export MAX_RETRIES=3

Command Line Options

# eval_model.sh options
--model-port PORT         # Model API port (default: 6000)
--extraction-port PORT    # Extraction model port (default: 6001)  
--num-workers N          # Number of concurrent workers (default: 128)
--use-llm-extraction     # Use LLM extraction (default: false)
--tasks TASKS            # Comma-separated task names (default: all)
--modes MODES            # Comma-separated modes: l,v,vl (default: all)

# Plotting options
--results-dir DIR        # Results directory (default: results)
--output-dir DIR         # Output directory for plots (default: plots)
--models MODEL1 MODEL2   # Specific models to plot
--plot-type TYPE         # Plot type: domains, heatmap, comparison, etc.

Handling Timeouts and Connection Errors

Timeout Configuration

The pipeline includes configurable timeouts to handle slow requests:

  1. Model Request Timeout: 120 seconds (default)

    • For VLM model evaluation requests
    • Can be adjusted with --request-timeout flag
  2. Extraction Timeout: 30 seconds (default)

    • For LLM answer extraction requests
    • Can be adjusted with --extraction-timeout flag

Example timeout output:

Request timeout on attempt 1/3 (timeout=120s): The operation timed out
Retrying in 1 seconds...

Retry Logic

The pipeline includes automatic retry logic for both timeouts and connection errors:

  1. Exponential Backoff: Retries with delays of 1s, 2s, 4s
  2. Max Retries: 3 attempts (configurable)
  3. Error Logging: Detailed error messages with worker ID and task info

Example error output:

Worker 48 error processing task 5360 (chess_fork, mode=vl): Connection error
Connection error on attempt 1/3: Connection refused
Retrying in 1 seconds...

Adjusting Timeouts

For slow models or overloaded servers, increase the timeout:

# Increase timeout to 300 seconds (5 minutes)
./eval_model.sh InternVL3-8B --request-timeout 300

# Different timeouts for model and extraction
./eval_model.sh InternVL3-8B --request-timeout 300 --extraction-timeout 60

Performance Optimization

  • Concurrent Workers: Adjust --num-workers based on your system
  • Batch Processing: Tasks are processed in parallel batches
  • Progress Tracking: Real-time ETA helps estimate completion time
  • Resource Usage: Each worker maintains its own connection pool

Troubleshooting

Common Issues

  1. Connection Errors: Check if model server is running on specified port
  2. Memory Issues: Reduce number of workers if OOM errors occur
  3. Extraction Failures: Enable --use-llm-extraction for better accuracy
  4. Missing Images: Ensure chess-bench dataset is properly downloaded

Debug Mode

# Run with detailed logging
python3 eval_pipeline.py --model InternVL3-8B --debug

# Test single task
python3 eval_pipeline.py --model InternVL3-8B --tasks fork --modes l

Citation

If you use this evaluation pipeline, please cite the original SEAM benchmark paper.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •