AI Image Detection Benchmark

A toolkit for benchmarking AI image detection APIs against various types of AI-generated images.

Overview

This project creates a benchmark dataset of real and AI-generated images, then evaluates multiple commercial AI detection APIs to measure their effectiveness. The goal is to assess the reliability of current detection tools across different generation methods, image categories, and adversarial perturbations.

Setup

1. Install Dependencies

uv pip install -r requirements.txt

2. Configure Environment

Copy the example environment file and fill in your API keys:

cp .env.example .env

Edit .env with your API credentials:

Image generation: Grok (xAI), FAL, Gemini, Replicate
Detection APIs: TruthScan, AIorNot, Resemble, SightEngine, etc.
LLM APIs: OpenAI (for image validation)
Storage: S3 bucket name and region

3. Database

The SQLite database (data/images.db) is auto-initialized on first run.

Architecture

Pipeline

Real Images (base)
  → validator.py                Validate base images with LLM
  → prompt_generator.py         Generate T2I prompts from base images
  → generator.py                Image editing (modify real images)
  → text_to_image.py            Text-to-image generation from prompts
  → adversarial.py              Adversarial perturbations on generated images
  → detector.py                 Run images through detection APIs
  → report.py                   Detection rate reports, heatmaps, ROC, AUC
  → report_adversarial_stats.py Adversarial effectiveness stats
  → report_deepfake_eval.py     Deepfake-Eval accuracy report

Database Schema

images - Base/real images with metadata, category, group, validation status
image_refs - S3 paths for both base and generated images
generated_images - Generated image records (generator, model, base_image_id)
detection_results - Detection API results (detector, ai_score, raw_response)

Category	Description	Generation Type
car_add_damage	Cars with AI-added damage	Image editing
delivery_proof	Porches with AI-added packages	Image editing
product	Products with AI-added defects	Image editing
receipts	Receipts with AI-modified amounts	Image editing
construction_site	Construction site images	Text-to-image
food	Food images	Text-to-image
getty_editorial	Editorial/news photos	Text-to-image
time_top_100	Iconic photographs	Text-to-image

Groups

Images are organized into groups for different benchmark runs:

Group 2: ~1,563 base images (varying per category)
Group 3: 120 base images (15 per category, balanced)

Detectors

Detector	API	Score Interpretation
aiornot	AI or Not	ai.confidence (0-1)
truthscan	TruthScan	0.5 +/- confidence/200 based on label
resemble	Resemble AI	image_metrics.score (0-1, higher=AI)
sightengine	SightEngine	type.ai_generated
realitydefender	Reality Defender	resultsSummary.metadata.finalScore/100
winston	Winston AI	ai_probability (0-1)
illuminarty	Illuminarty	data.probability (0-1)
inaza	Inaza	overall_risk_score (0-1)

Generators

Image Editing

grok-imagine-image-beta - Grok image editing
gpt-image-1.5 - GPT image editing
gemini-3-pro-image-preview - Gemini image editing

Text-to-Image

grok-imagine-image-beta - Grok T2I
gpt-image-1.5 - GPT T2I
gemini-3-pro-image-preview - Gemini T2I
qwen-image-2512 - Qwen T2I
seedream-v4.5-1k - Seedream T2I

Usage

1. Validate Base Images

# Validate base images using LLM (reject bad/unusable images)
python validator.py pixabay car_add_damage --limit 50
python validator.py pixabay food --batch-size 4

2. Generate Prompts (for T2I categories)

# Generate T2I prompts from base images using Gemini
python prompt_generator.py batch --category food --group 3
python prompt_generator.py batch --category construction_site --group 3 --limit 20

3. Generate Images

# Image editing (modify real images)
python generator.py --category car_add_damage --generator grok --group 3

# Text-to-image (from generated prompts)
python text_to_image.py --category food --generator openai-t2i --group 3

# Adversarial perturbations (on existing generated images)
python adversarial.py batch --source-generator grok --group 3

4. Run Detection

# Detect generated images with a specific detector
python detector.py -d resemble -t generated --group 3

# Detect base images (for false positive rates)
python detector.py -d aiornot -t base --group 3

# Limit and randomize
python detector.py -d resemble -t generated --limit 100 --random

5. View results

# Basic detection rate report
python report.py --group 3

# With adversarial comparison
python report.py --group 3 --adversarial

# With adversarial AUC and paired analysis
python report.py --group 3 --adversarial-auc --adversarial-paired

# Heatmaps and ROC curves
python report.py --group 3 --detection-heatmap --effect-size-heatmap --aggregate-roc

# LaTeX tables for paper
python report.py --group 3 --latex

Key Files

File	Purpose
`config.py`	Configuration and environment loading
`storage.py`	Database and S3 operations
`detector.py`	Detection API implementations
`report.py`	Reporting and visualization
`generator.py`	Image editing generation
`text_to_image.py`	Text-to-image generation
`adversarial.py`	Adversarial perturbation pipeline
`prompt_generator.py`	LLM prompt generation for T2I
`validator.py`	LLM-based image validation
`check_c2pa.py`	C2PA metadata checker
`image_utils.py`	Shared image utilities

Detection Score Interpretation

All detectors normalize scores to 0-1 where:

0.0 = Definitely real
0.5 = Uncertain
1.0 = Definitely AI-generated

Default detection threshold is 0.5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Image Detection Benchmark

Overview

Setup

1. Install Dependencies

2. Configure Environment

3. Database

Architecture

Pipeline

Database Schema

Categories

Groups

Detectors

Generators

Image Editing

Text-to-Image

Usage

1. Validate Base Images

2. Generate Prompts (for T2I categories)

3. Generate Images

4. Run Detection

5. View results

Key Files

Detection Score Interpretation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

AI Image Detection Benchmark

Overview

Setup

1. Install Dependencies

2. Configure Environment

3. Database

Architecture

Pipeline

Database Schema

Categories

Groups

Detectors

Generators

Image Editing

Text-to-Image

Usage

1. Validate Base Images

2. Generate Prompts (for T2I categories)

3. Generate Images

4. Run Detection

5. View results

Key Files

Detection Score Interpretation