Skip to content

Production-grade MLOps pipeline for AI-generated product image detection. EfficientNet-B0, FastAPI, DVC, MLflow, Vertex AI Pipelines, Terraform, Cloud Run, Prometheus/Grafana monitoring.

License

Notifications You must be signed in to change notification settings

nolancacheux/AI-Product-Photo-Detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

247 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Product Photo Detector

Production-grade MLOps system for detecting AI-generated product photos in e-commerce

Python PyTorch FastAPI Docker DVC Terraform License


Nolan Cacheux · GitHub · LinkedIn


Architecture

Live Demo

Demo — Upload, Predict & Grad-CAM Explainability

Demo Screenshots

Streamlit Web UI (Production)

Upload Interface Prediction Result

Grad-CAM Details Grad-CAM Heatmap

Presentation

Explore the full project walkthrough, architecture decisions, and results in the presentation slides:

View Presentation on Google Slides

Title Slide Architecture

Training Results Conclusion

Overview

End-to-end machine learning system that classifies product photos as real or AI-generated using an EfficientNet-B0 model with Grad-CAM explainability. The project covers the full MLOps lifecycle: from DVC-managed data pipelines and GPU training (local, Colab, or Vertex AI) to a FastAPI serving layer with authentication, rate limiting, and Prometheus monitoring. Infrastructure is provisioned with Terraform and deployed serverlessly via Docker and GitHub Actions CI/CD.

Features

Category Feature Description
ML Model EfficientNet-B0 + Grad-CAM Transfer learning with ImageNet weights via timm, visual heatmap explainability
API FastAPI with auth & rate limiting Single, batch (up to 10), and explain endpoints with Pydantic v2 schemas
Training 3 training modes Local (Docker/CPU), Google Colab (free T4 GPU), Vertex AI (production T4 GPU)
Monitoring Prometheus + Grafana + drift 18+ custom metrics, auto-provisioned dashboards, real-time drift detection
Infrastructure Terraform + Docker + Cloud Run Modular IaC, full Docker Compose stack, serverless deployment
CI/CD GitHub Actions (5 workflows) Lint, test, build, deploy with quality gates (accuracy ≥ 0.85, F1 ≥ 0.80)

Training Options

Local Training: Docker-based, for development

When to use: Development, debugging, quick iterations on CPU.

Prerequisites: Python 3.11+, Docker & Docker Compose, Make

# Download the CIFAKE dataset
make data

# Train with default config
make train

# Or run the full DVC pipeline: download → validate → train
make dvc-repro

# Start MLflow UI to view experiments
make mlflow  # → http://localhost:5000

Training takes ~1–2 hours on CPU. Edit configs/train_config.yaml to adjust hyperparameters.

Google Colab: Free T4 GPU, one-click notebook

When to use: Quick experiments with free GPU, no local setup needed.

Prerequisites: Google account

The notebook (notebooks/train_colab.ipynb) handles everything automatically:

  1. Installs dependencies and clones the repository
  2. Downloads the CIFAKE dataset from HuggingFace
  3. Trains EfficientNet-B0 with progress tracking
  4. Evaluates the model and exports the checkpoint
  5. Optionally uploads the trained model to GCS

Open in Colab → set runtime to T4 GPU → Run all cells. Training takes ~20 minutes.

Vertex AI Pipeline: Production training on GCP

When to use: Production retraining, CI/CD-triggered training, reproducible GPU runs.

Prerequisites: GCP project with Vertex AI enabled, gcloud CLI configured, GCS bucket with data

# Trigger via GitHub Actions
gh workflow run model-training.yml \
  -f epochs=15 \
  -f batch_size=64 \
  -f auto_deploy=true

# Or submit directly
python -m src.training.vertex_submit --epochs 15 --batch-size 64 --sync

Pipeline stages: Verify Data → Build Image → GPU Training (T4) → Evaluate → Quality Gate → Deploy

The pipeline DAG is defined using KFP (Kubeflow Pipelines SDK), which is the standard Python SDK for orchestrating workflows on Google Vertex AI Pipelines.

Training takes ~25 minutes. The quality gate blocks deployment if accuracy < 0.85 or F1 < 0.80.

Quick Start

Prerequisites:

First-time GCP setup:

gcloud auth login
gcloud config set project ai-product-detector-487013
gcloud auth application-default login  # Required for DVC

Installation:

git clone https://github.com/nolancacheux/AI-Product-Photo-Detector.git
cd AI-Product-Photo-Detector
make dev  # Install dependencies + pre-commit hooks

Download the model (required before Docker):

The trained model weights (54 MB) are stored in GCS via DVC, not in the Git repository.

# Option 1: Using DVC (requires dvc-gs: pip install dvc-gs)
dvc pull models/checkpoints/best_model.pt.dvc

# Option 2: Using gcloud CLI
gcloud storage cp gs://ai-product-detector-487013-mlops-data/dvc/files/md5/0b/b8844b5c1b11d212a306590671a645 models/checkpoints/best_model.pt

First time? Train the model from scratch:

If no model exists in GCS yet, you need to train it first. This downloads the CIFAKE dataset (120k images) and trains EfficientNet-B0:

make dev                # Install dependencies
dvc repro               # Run full pipeline: download data → validate → train
dvc push                # Upload the trained model to GCS for the team

Alternatively, run each step manually:

python scripts/download_cifake.py                           # Download dataset
python -m src.data.validate --data-dir data/processed       # Validate data
python -m src.training.train --config configs/train_config.yaml  # Train model

The trained model will be saved to models/checkpoints/best_model.pt.

Run locally:

make serve             # Local dev server → http://localhost:8000
docker compose up -d   # Full stack (production) → ports below

Note: make serve runs Uvicorn on port 8000 (local development). Docker Compose exposes the API on port 8080 (container/production).

Service URL
API (Docker) http://localhost:8080
Streamlit UI http://localhost:8501
MLflow http://localhost:5000
Prometheus http://localhost:9090
Grafana http://localhost:3000 (default credentials: admin / admin)

Test:

make test  # pytest with coverage
make lint  # ruff + mypy

Production Deployment

The application is deployed on Google Cloud Run (serverless). Both the API and Web UI are publicly accessible:

Note: These services may be turned off to avoid unnecessary costs. This is a university project and keeping them running permanently is not required. If a link doesn't work, the service has simply been shut down.

See the full project walkthrough in our presentation slides.

Service URL
API (Production) https://ai-product-detector-714127049161.europe-west1.run.app
Web UI (Production) https://ai-product-detector-ui-714127049161.europe-west1.run.app
API Documentation https://ai-product-detector-714127049161.europe-west1.run.app/docs
Health Check https://ai-product-detector-714127049161.europe-west1.run.app/health
Metrics (Prometheus) https://ai-product-detector-714127049161.europe-west1.run.app/metrics
Production Screenshots

Cloud Run Services

Cloud Run Metrics

Swagger API Documentation

API Health Check

API Reference

Endpoints

Method Endpoint Description Rate Limit
POST /predict Single image classification 30/min
POST /predict/batch Batch classification (up to 10 images) 5/min
POST /predict/explain Prediction + Grad-CAM heatmap 10/min
GET /health Readiness probe (model status, uptime, drift) -
GET /metrics Prometheus metrics (text format) -

Authentication

Authentication is optional in development and enforced in production via environment variables.

Variable Description
API_KEYS Comma-separated list of valid API keys
REQUIRE_AUTH Set to true to enforce authentication

Pass the key via header: X-API-Key: YOUR_KEY

Response Format

{
  "prediction": "ai_generated",
  "probability": 0.87,
  "confidence": "high",
  "inference_time_ms": 45.2,
  "model_version": "1.0.0"
}
Monitoring

Monitoring Flow

/metrics (raw text) → Prometheus (collection & storage) → Grafana (dashboards & alerts)

The API exposes raw Prometheus metrics at /metrics. Prometheus scrapes this endpoint at regular intervals and stores the time-series data. Grafana connects to Prometheus as a datasource to render real-time dashboards and trigger alerts.

Prometheus Metrics

All exposed at GET /metrics in Prometheus text format:

Metric Type Description
aidetect_predictions_total Counter Total predictions by status, class, confidence
aidetect_prediction_latency_seconds Histogram Per-prediction latency distribution
aidetect_prediction_probability Histogram Probability score distribution
aidetect_batch_predictions_total Counter Batch request count
aidetect_batch_size Histogram Images per batch request
aidetect_model_loaded Gauge Model load status (0/1)
http_request_duration_seconds Histogram HTTP latency by endpoint
http_requests_total Counter HTTP requests by method, endpoint, status

Grafana Dashboards

Pre-configured and auto-provisioned via configs/grafana/provisioning/:

  • Request throughput - Requests/sec by endpoint
  • Latency percentiles - p50, p90, p99 per endpoint
  • Prediction distribution - Real vs AI-generated ratio over time
  • Model health - Load status, drift alerts, error rates

Default credentials: admin / admin

Dashboard Screenshots

Grafana Monitoring Dashboard

Prometheus Query UI

Prometheus Metrics Endpoint (Production)

Drift Detection

Real-time monitoring of prediction distribution shifts using a sliding window over the last 1,000 predictions. Tracks mean probability, confidence distribution, and class ratios. Configurable alert thresholds with status available at GET /drift.

Tech Stack

Layer Technologies
ML PyTorch 2.0+, torchvision, timm (EfficientNet-B0), Grad-CAM
API FastAPI, Uvicorn, Pydantic v2, slowapi
MLOps DVC (pipelines + versioning), MLflow (experiment tracking), HuggingFace Datasets
Monitoring Prometheus, Grafana, structlog (JSON logging), custom drift detection
Infrastructure Docker, Docker Compose, Terraform (modular), Cloud Run, Artifact Registry
CI/CD GitHub Actions (CI, CD, Model Training, PR Preview, Request Quota)
Cloud Google Cloud Platform (Vertex AI, Cloud Run, GCS, Artifact Registry, Secret Manager)
Cloud Infrastructure

GCS Bucket Structure

GCS Terraform State Backend

Project Structure
AI-Product-Photo-Detector/
├── .github/workflows/
│   ├── ci.yml                        # Lint + type-check + test (3.11, 3.12) + security
│   ├── cd.yml                        # Build → push → deploy → smoke test
│   ├── model-training.yml            # Vertex AI GPU training pipeline
│   ├── pr-preview.yml                # PR preview deployments
│   └── request-quota.yml             # GCP quota increase requests
├── configs/
│   ├── grafana/                      # Dashboard definitions + provisioning
│   ├── prometheus/                   # Alerting rules
│   ├── inference_config.yaml         # API server configuration
│   ├── pipeline_config.yaml          # Vertex AI pipeline parameters
│   ├── prometheus.yml                # Prometheus scrape targets
│   └── train_config.yaml             # Training hyperparameters
├── docker/
│   ├── Dockerfile                    # Production API image (non-root)
│   ├── Dockerfile.training           # Vertex AI GPU training image
│   └── ui.Dockerfile                 # Streamlit UI image
├── docs/
│   ├── architecture.svg              # System architecture diagram
│   ├── ARCHITECTURE.md               # Design decisions
│   ├── CICD.md                       # CI/CD pipeline docs
│   ├── CONTRIBUTING.md               # Contribution guidelines
│   ├── COSTS.md                      # Cloud cost analysis
│   ├── DEPLOYMENT.md                 # Deployment guide
│   ├── INFRASTRUCTURE.md             # Infrastructure docs
│   ├── MONITORING.md                 # Monitoring guide
│   └── TRAINING.md                   # Training pipeline docs
├── notebooks/
│   └── train_colab.ipynb             # Colab notebook (free T4 GPU)
├── scripts/                          # Dataset download & sample data utilities
├── src/
│   ├── data/
│   │   └── validate.py               # Dataset validation & integrity checks
│   ├── inference/
│   │   ├── api.py                    # FastAPI application & routes
│   │   ├── auth.py                   # API key auth (HMAC, constant-time)
│   │   ├── explainer.py              # Grad-CAM heatmap generation
│   │   ├── predictor.py              # Model inference engine
│   │   ├── rate_limit.py             # Rate limiting configuration
│   │   ├── routes/
│   │   │   ├── v1/                   # Versioned API endpoints
│   │   ├── schemas.py                # Pydantic request/response models
│   │   ├── shadow.py                 # Shadow model A/B testing
│   │   ├── state.py                  # Application state management
│   │   └── validation.py             # Image validation utilities
│   ├── monitoring/
│   │   ├── drift.py                  # Real-time drift detection
│   │   └── metrics.py                # Prometheus metric definitions
│   ├── pipelines/
│   │   ├── evaluate.py               # Model evaluation stage
│   │   └── training_pipeline.py      # End-to-end training orchestrator
│   ├── training/
│   │   ├── augmentation.py           # Data augmentation transforms
│   │   ├── dataset.py                # PyTorch Dataset implementation
│   │   ├── gcs.py                    # GCS upload/download helpers
│   │   ├── model.py                  # EfficientNet-B0 architecture
│   │   ├── train.py                  # Training loop with MLflow tracking
│   │   └── vertex_submit.py          # Vertex AI job submission CLI
│   ├── ui/
│   │   └── app.py                    # Streamlit web interface
│   └── utils/
│       ├── config.py                 # Settings management (Pydantic Settings)
│       ├── logger.py                 # Structured logging setup
│       └── model_loader.py           # Model loading utilities
├── terraform/
│   ├── environments/
│   │   ├── dev/                      # Development environment
│   │   └── prod/                     # Production environment
│   ├── modules/
│   │   ├── cloud-run/                # Cloud Run service module
│   │   ├── iam/                      # IAM bindings module
│   │   ├── monitoring/               # Monitoring module
│   │   ├── registry/                 # Artifact Registry module
│   │   └── storage/                  # GCS bucket module
│   ├── backend.tf                    # Terraform state backend (GCS)
│   └── versions.tf                   # Provider version constraints
├── tests/
│   ├── load/                         # Locust + k6 load tests
│   ├── conftest.py                   # Shared test fixtures
│   └── test_*.py                     # 28+ test modules (API, auth, model, training, ...)
├── docker-compose.yml                # Full stack: API + UI + MLflow + Prometheus + Grafana
├── dvc.yaml                          # DVC pipeline: download → validate → train
├── Makefile                          # Development commands
├── pyproject.toml                    # Dependencies & tool config
└── LICENSE                           # MIT License

Documentation

Document Description
Architecture System architecture and design decisions
Training Guide Training pipeline documentation (all 3 modes)
Deployment Deployment guide
Monitoring Monitoring and observability guide
CI/CD CI/CD pipeline documentation
Infrastructure Infrastructure and Terraform documentation
Costs Cloud cost analysis
Contributing Contribution guidelines

License

MIT License - see LICENSE for details.


Made by Nolan Cacheux

About

Production-grade MLOps pipeline for AI-generated product image detection. EfficientNet-B0, FastAPI, DVC, MLflow, Vertex AI Pipelines, Terraform, Cloud Run, Prometheus/Grafana monitoring.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published