This project aims to evaluate different AI models' performance in analyzing automated transcriptions to structured data from historical art photographs. It includes tools for processing images, comparing model outputs, updating ground truth, and visualizing results through an interactive web interface.
There is a static version for viewing on github in the directory '/static'. Installing it locally you can update ground truth files to tune model assesment/performance.
Repository: historical-photograph-AI-annotator-benchmark
The system analyzes photographs of pre-1700 artworks, processing both front and back images to extract detailed metadata including:
- Artwork details (title, artist, date, inscriptions)
- Repository information
- Dimensions
- Material information
- Historical data (exhibitions, provenance, literature)
- Photographer details
The benchmark compares different AI models including:
- gpt-4o
- gpt-4o-mini
- o1
- Claude 3.5 Sonnet
- Gemeni 2.5 Preview
- Python 3.8+
- Node.js 18+
- npm
- API keys for:
- OpenAI (for GPT-4 Vision models)
- Anthropic (for Claude)
- Clone the repository:
git clone git@github.com:lklic/historical-photograph-AI-annotator-benchmark.git
cd historical-photograph-AI-annotator-benchmark- Create and set up API key files:
# For OpenAI API key
echo "your-openai-key" > key.secret
# For Anthropic API key
echo "your-anthropic-key" > claudekey.secret- Install Python dependencies:
pip install anthropic openai httpx.
├── analysis_script.py # Main analysis script
├── process_images.py # Image processing script
├── prompt.txt # Prompt template for models
├── test-images.md # List of test image IDs
├── App.jsx # React frontend application
├── package.json # Node.js dependencies
├── vite.config.js # Vite configuration
└── run-benchmark.sh # Benchmark runner script
The workflow consists of three main steps:
- Generate ground truth files:
python produce_ground_truth.pyThis will create ground truth annotations for the test images using Claude 3.5 Sonnet.
- Process images with different models:
python process_images.py benchmark prompt.txtThis will process all images in test-images.md with each configured model.
- Run the benchmark and start the visualization server:
./run-benchmark.shThis will:
- Set up the web application structure
- Run the analysis comparing all model outputs
- Start a local development server
For processing individual images:
python process_images.py single <model> <image_id> <prompt_file>Example:
python process_images.py single claude3.5 "32044103326807!32044156028839" prompt.txtThe analysis generates:
- Individual JSON files for each image analysis
- A comprehensive analysis.json file with comparative metrics
- A web interface for visualizing results
The web interface provides:
- Model comparison summary with accuracy and cost metrics
- Detailed view of individual image analyses
- Side-by-side comparison of model outputs
- Interactive image viewer with zoom capability
Models can be configured in process_images.py:
MODEL_CONFIGS = {
'gpt-4o': ProcessingConfig(
api_type='openai',
model='gpt-4o',
input_cost_per_million=2.5,
output_cost_per_million=10.0
),
'claude3.5': ProcessingConfig(
api_type='claude',
model='claude-3-5-sonnet-20241022',
input_cost_per_million=3.0,
output_cost_per_million=15.0
)
# ... other models
}The analysis script (analysis_script.py) includes configurable parameters for:
- Field comparison logic
- Metrics calculation
- Output formatting