A toolkit for benchmarking AI image detection APIs against various types of AI-generated images.
This project creates a benchmark dataset of real and AI-generated images, then evaluates multiple commercial AI detection APIs to measure their effectiveness. The goal is to assess the reliability of current detection tools across different generation methods, image categories, and adversarial perturbations.
uv pip install -r requirements.txtCopy the example environment file and fill in your API keys:
cp .env.example .envEdit .env with your API credentials:
- Image generation: Grok (xAI), FAL, Gemini, Replicate
- Detection APIs: TruthScan, AIorNot, Resemble, SightEngine, etc.
- LLM APIs: OpenAI (for image validation)
- Storage: S3 bucket name and region
The SQLite database (data/images.db) is auto-initialized on first run.
Real Images (base)
→ validator.py Validate base images with LLM
→ prompt_generator.py Generate T2I prompts from base images
→ generator.py Image editing (modify real images)
→ text_to_image.py Text-to-image generation from prompts
→ adversarial.py Adversarial perturbations on generated images
→ detector.py Run images through detection APIs
→ report.py Detection rate reports, heatmaps, ROC, AUC
→ report_adversarial_stats.py Adversarial effectiveness stats
→ report_deepfake_eval.py Deepfake-Eval accuracy report
images- Base/real images with metadata, category, group, validation statusimage_refs- S3 paths for both base and generated imagesgenerated_images- Generated image records (generator, model, base_image_id)detection_results- Detection API results (detector, ai_score, raw_response)
| Category | Description | Generation Type |
|---|---|---|
| car_add_damage | Cars with AI-added damage | Image editing |
| delivery_proof | Porches with AI-added packages | Image editing |
| product | Products with AI-added defects | Image editing |
| receipts | Receipts with AI-modified amounts | Image editing |
| construction_site | Construction site images | Text-to-image |
| food | Food images | Text-to-image |
| getty_editorial | Editorial/news photos | Text-to-image |
| time_top_100 | Iconic photographs | Text-to-image |
Images are organized into groups for different benchmark runs:
- Group 2: ~1,563 base images (varying per category)
- Group 3: 120 base images (15 per category, balanced)
| Detector | API | Score Interpretation |
|---|---|---|
| aiornot | AI or Not | ai.confidence (0-1) |
| truthscan | TruthScan | 0.5 +/- confidence/200 based on label |
| resemble | Resemble AI | image_metrics.score (0-1, higher=AI) |
| sightengine | SightEngine | type.ai_generated |
| realitydefender | Reality Defender | resultsSummary.metadata.finalScore/100 |
| winston | Winston AI | ai_probability (0-1) |
| illuminarty | Illuminarty | data.probability (0-1) |
| inaza | Inaza | overall_risk_score (0-1) |
grok-imagine-image-beta- Grok image editinggpt-image-1.5- GPT image editinggemini-3-pro-image-preview- Gemini image editing
grok-imagine-image-beta- Grok T2Igpt-image-1.5- GPT T2Igemini-3-pro-image-preview- Gemini T2Iqwen-image-2512- Qwen T2Iseedream-v4.5-1k- Seedream T2I
# Validate base images using LLM (reject bad/unusable images)
python validator.py pixabay car_add_damage --limit 50
python validator.py pixabay food --batch-size 4# Generate T2I prompts from base images using Gemini
python prompt_generator.py batch --category food --group 3
python prompt_generator.py batch --category construction_site --group 3 --limit 20# Image editing (modify real images)
python generator.py --category car_add_damage --generator grok --group 3
# Text-to-image (from generated prompts)
python text_to_image.py --category food --generator openai-t2i --group 3
# Adversarial perturbations (on existing generated images)
python adversarial.py batch --source-generator grok --group 3# Detect generated images with a specific detector
python detector.py -d resemble -t generated --group 3
# Detect base images (for false positive rates)
python detector.py -d aiornot -t base --group 3
# Limit and randomize
python detector.py -d resemble -t generated --limit 100 --random# Basic detection rate report
python report.py --group 3
# With adversarial comparison
python report.py --group 3 --adversarial
# With adversarial AUC and paired analysis
python report.py --group 3 --adversarial-auc --adversarial-paired
# Heatmaps and ROC curves
python report.py --group 3 --detection-heatmap --effect-size-heatmap --aggregate-roc
# LaTeX tables for paper
python report.py --group 3 --latex| File | Purpose |
|---|---|
config.py |
Configuration and environment loading |
storage.py |
Database and S3 operations |
detector.py |
Detection API implementations |
report.py |
Reporting and visualization |
generator.py |
Image editing generation |
text_to_image.py |
Text-to-image generation |
adversarial.py |
Adversarial perturbation pipeline |
prompt_generator.py |
LLM prompt generation for T2I |
validator.py |
LLM-based image validation |
check_c2pa.py |
C2PA metadata checker |
image_utils.py |
Shared image utilities |
All detectors normalize scores to 0-1 where:
- 0.0 = Definitely real
- 0.5 = Uncertain
- 1.0 = Definitely AI-generated
Default detection threshold is 0.5.