High-performance evaluation pipeline for the SEAM benchmark for Vision-Language Models.
SEAM addresses fundamental limitations in existing benchmarks through its utilization of distinct notation systems and preservation of semantic equivalence across modalities. By leveraging domain-specific standardized representations in:
- Chess: Board images vs. FEN strings
- Chemistry: Structural diagrams vs. SMILES strings
- Music: Staff images vs. ABC notations
- Graph Theory: Node-edge diagrams vs. adjacency matrices
SEAM presents both visual-spatial and textual-symbolic representations while maintaining semantic equivalence. The benchmark comprises 16 carefully calibrated tasks designed to be self-contained in both modalities with 3,200 four-way multiple-choice questions in total.
- High Concurrency: Producer-worker pattern with 128 concurrent workers by default
- Retry Mechanism: Automatic retry with exponential backoff for connection errors
- Progress Tracking: Real-time progress bar with ETA using tqdm
- Robust Answer Extraction: LLM-based extraction with regex fallback
- Flexible Plotting: Multiple plot types for comprehensive analysis
- Model-Specific Organization: Results saved in
results/{model_name}/
directories
# Install dependencies
pip install -r requirements.txt
# Or use the setup script
./setup.sh
# Set environment variables for batch APIs (optional)
export AZURE_OPENAI_API_KEY=your_azure_key # For OpenAI batch
export AZURE_OPENAI_ENDPOINT=your_endpoint # For OpenAI batch
export ANTHROPIC_API_KEY=your_anthropic_key # For Claude batch
For models running on OpenAI-compatible API servers:
python run.py --model-name "InternVL3-14B" --model-urls "192.168.55.245:6000"
For GPT-4V and other Azure OpenAI models:
python run_batch.py --provider openai --action all --model gpt-4-vision
For Anthropic Claude models:
python run_batch.py --provider claude --model claude-3-5-sonnet-20241022
python run.py --model-name "YourModel" \
--model-urls "host1:port1 host2:port2" \
--use-llm-extraction \
--max-concurrency 256
# OpenAI: Upload, Submit, Retrieve
python run_batch.py --provider openai --action upload --model gpt-4-vision
python run_batch.py --provider openai --action submit --model gpt-4-vision
python run_batch.py --provider openai --action retrieve --model gpt-4-vision
# Claude: Different models and modes
python run_batch.py --provider claude --model claude-3-5-haiku-20241022 --mode l # Language only
python run_batch.py --provider claude --model claude-3-5-sonnet-20241022 --mode v # Vision only
To generate the SEAM benchmark dataset manually, run the following scripts from the chess-bench/code/
directory:
cd chess-bench/code/
# Generate Chemistry tasks
python dataset_chem.py
# Generate Chess tasks
python dataset_chess.py
# Generate Graph Theory tasks
python dataset_graph.py
# Generate Music tasks
python dataset_music.py
Each script will generate task-specific data, images, and question files in the chess-bench/data/benchmark/
directory. You can also directly download the pre-generated dataset from this link and unzip it under that directory.
# Basic plots (domains and heatmap)
./plot.sh
# Advanced comparison plots
python3 plot_comparison.py --plot-type all
# Specific plot types
python3 generate_plots.py --plot-type domains
python3 plot_comparison.py --plot-type task-heatmap --models InternVL3-8B InternVL3-14B
seam-benchmark/
├── run.py # Main entry point for standard evaluation
├── run_batch.py # Batch API runner for OpenAI/Claude
├── eval_model.sh # Shell evaluation script
├── config.py # Configuration settings
├── src/
│ ├── core/
│ │ ├── eval_pipeline.py # Core evaluation pipeline
│ │ ├── task_loader.py # Task loading utilities
│ │ └── vlm.py # VLM interfaces
│ ├── utils/
│ │ ├── util.py # VLM completion utilities
│ │ └── openai_util.py # OpenAI-style API utilities
│ └── visualization/
│ ├── generate_plots.py # Basic plotting
│ ├── generate_combined_plots.py # Combined plots
│ └── generate_latex_table.py # LaTeX table generation
├── chess-bench/ # Dataset and generation code
│ ├── data/ # Benchmark datasets
│ └── code/ # Dataset generation scripts
├── results/ # Model evaluation results
└── plots/ # Generated visualization
│ ├── InternVL3-8B/
│ │ ├── results.jsonl
│ │ ├── results.csv
│ │ └── stats.json
│ └── {model_name}/
└── plots/ # Generated plots
├── domains.pdf
├── heatmap.pdf
├── model_comparison.pdf
└── error_analysis.pdf
The SEAM benchmark includes 16 tasks across 4 domains:
- Chess: fork, legal, puzzle, eval
- Chemistry: carbon, hydrogen, weight, caption
- Music: notes, measures, forms, rhythm
- Graph: path_counting, path_existence, shortest_path, bfs_traversal
Each task is evaluated in 3 modalities:
- L (Language-only): Text-only input using standardized notations
- V (Vision-only): Image-only input with visual representations
- VL (Vision-Language): Combined text and image input
The pipeline uses a two-stage extraction process:
- Regex Extraction: Fast pattern matching for common answer formats
- LLM Extraction: Fallback using GPT-4o-mini for ambiguous outputs
The evaluation pipeline uses two separate servers:
- Model Evaluation Server:
192.168.55.245
(for evaluating VLM models) - Extraction Server:
192.168.55.244
(for LLM-based answer extraction)
# Model configuration
export MODEL_PORT=6000 # Port on 192.168.55.245
export EXTRACTION_PORT=6001 # Port on 192.168.55.244
# Worker configuration
export NUM_WORKERS=128
export MAX_RETRIES=3
# eval_model.sh options
--model-port PORT # Model API port (default: 6000)
--extraction-port PORT # Extraction model port (default: 6001)
--num-workers N # Number of concurrent workers (default: 128)
--use-llm-extraction # Use LLM extraction (default: false)
--tasks TASKS # Comma-separated task names (default: all)
--modes MODES # Comma-separated modes: l,v,vl (default: all)
# Plotting options
--results-dir DIR # Results directory (default: results)
--output-dir DIR # Output directory for plots (default: plots)
--models MODEL1 MODEL2 # Specific models to plot
--plot-type TYPE # Plot type: domains, heatmap, comparison, etc.
The pipeline includes configurable timeouts to handle slow requests:
-
Model Request Timeout: 120 seconds (default)
- For VLM model evaluation requests
- Can be adjusted with
--request-timeout
flag
-
Extraction Timeout: 30 seconds (default)
- For LLM answer extraction requests
- Can be adjusted with
--extraction-timeout
flag
Example timeout output:
Request timeout on attempt 1/3 (timeout=120s): The operation timed out
Retrying in 1 seconds...
The pipeline includes automatic retry logic for both timeouts and connection errors:
- Exponential Backoff: Retries with delays of 1s, 2s, 4s
- Max Retries: 3 attempts (configurable)
- Error Logging: Detailed error messages with worker ID and task info
Example error output:
Worker 48 error processing task 5360 (chess_fork, mode=vl): Connection error
Connection error on attempt 1/3: Connection refused
Retrying in 1 seconds...
For slow models or overloaded servers, increase the timeout:
# Increase timeout to 300 seconds (5 minutes)
./eval_model.sh InternVL3-8B --request-timeout 300
# Different timeouts for model and extraction
./eval_model.sh InternVL3-8B --request-timeout 300 --extraction-timeout 60
- Concurrent Workers: Adjust
--num-workers
based on your system - Batch Processing: Tasks are processed in parallel batches
- Progress Tracking: Real-time ETA helps estimate completion time
- Resource Usage: Each worker maintains its own connection pool
- Connection Errors: Check if model server is running on specified port
- Memory Issues: Reduce number of workers if OOM errors occur
- Extraction Failures: Enable
--use-llm-extraction
for better accuracy - Missing Images: Ensure chess-bench dataset is properly downloaded
# Run with detailed logging
python3 eval_pipeline.py --model InternVL3-8B --debug
# Test single task
python3 eval_pipeline.py --model InternVL3-8B --tasks fork --modes l
If you use this evaluation pipeline, please cite the original SEAM benchmark paper.