Production-grade MLOps system for detecting AI-generated product photos in e-commerce
Nolan Cacheux · GitHub · LinkedIn
Explore the full project walkthrough, architecture decisions, and results in the presentation slides:
View Presentation on Google Slides
End-to-end machine learning system that classifies product photos as real or AI-generated using an EfficientNet-B0 model with Grad-CAM explainability. The project covers the full MLOps lifecycle: from DVC-managed data pipelines and GPU training (local, Colab, or Vertex AI) to a FastAPI serving layer with authentication, rate limiting, and Prometheus monitoring. Infrastructure is provisioned with Terraform and deployed serverlessly via Docker and GitHub Actions CI/CD.
| Category | Feature | Description |
|---|---|---|
| ML Model | EfficientNet-B0 + Grad-CAM | Transfer learning with ImageNet weights via timm, visual heatmap explainability |
| API | FastAPI with auth & rate limiting | Single, batch (up to 10), and explain endpoints with Pydantic v2 schemas |
| Training | 3 training modes | Local (Docker/CPU), Google Colab (free T4 GPU), Vertex AI (production T4 GPU) |
| Monitoring | Prometheus + Grafana + drift | 18+ custom metrics, auto-provisioned dashboards, real-time drift detection |
| Infrastructure | Terraform + Docker + Cloud Run | Modular IaC, full Docker Compose stack, serverless deployment |
| CI/CD | GitHub Actions (5 workflows) | Lint, test, build, deploy with quality gates (accuracy ≥ 0.85, F1 ≥ 0.80) |
Local Training: Docker-based, for development
When to use: Development, debugging, quick iterations on CPU.
Prerequisites: Python 3.11+, Docker & Docker Compose, Make
# Download the CIFAKE dataset
make data
# Train with default config
make train
# Or run the full DVC pipeline: download → validate → train
make dvc-repro
# Start MLflow UI to view experiments
make mlflow # → http://localhost:5000Training takes ~1–2 hours on CPU. Edit configs/train_config.yaml to adjust hyperparameters.
Google Colab: Free T4 GPU, one-click notebook
When to use: Quick experiments with free GPU, no local setup needed.
Prerequisites: Google account
The notebook (notebooks/train_colab.ipynb) handles everything automatically:
- Installs dependencies and clones the repository
- Downloads the CIFAKE dataset from HuggingFace
- Trains EfficientNet-B0 with progress tracking
- Evaluates the model and exports the checkpoint
- Optionally uploads the trained model to GCS
Open in Colab → set runtime to T4 GPU → Run all cells. Training takes ~20 minutes.
Vertex AI Pipeline: Production training on GCP
When to use: Production retraining, CI/CD-triggered training, reproducible GPU runs.
Prerequisites: GCP project with Vertex AI enabled, gcloud CLI configured, GCS bucket with data
# Trigger via GitHub Actions
gh workflow run model-training.yml \
-f epochs=15 \
-f batch_size=64 \
-f auto_deploy=true
# Or submit directly
python -m src.training.vertex_submit --epochs 15 --batch-size 64 --syncPipeline stages: Verify Data → Build Image → GPU Training (T4) → Evaluate → Quality Gate → Deploy
The pipeline DAG is defined using KFP (Kubeflow Pipelines SDK), which is the standard Python SDK for orchestrating workflows on Google Vertex AI Pipelines.
Training takes ~25 minutes. The quality gate blocks deployment if accuracy < 0.85 or F1 < 0.80.
Prerequisites:
- Python 3.11+
- Docker & Docker Compose
- Make
- Google Cloud CLI (
gcloud)
First-time GCP setup:
gcloud auth login
gcloud config set project ai-product-detector-487013
gcloud auth application-default login # Required for DVCInstallation:
git clone https://github.com/nolancacheux/AI-Product-Photo-Detector.git
cd AI-Product-Photo-Detector
make dev # Install dependencies + pre-commit hooksDownload the model (required before Docker):
The trained model weights (54 MB) are stored in GCS via DVC, not in the Git repository.
# Option 1: Using DVC (requires dvc-gs: pip install dvc-gs)
dvc pull models/checkpoints/best_model.pt.dvc
# Option 2: Using gcloud CLI
gcloud storage cp gs://ai-product-detector-487013-mlops-data/dvc/files/md5/0b/b8844b5c1b11d212a306590671a645 models/checkpoints/best_model.ptFirst time? Train the model from scratch:
If no model exists in GCS yet, you need to train it first. This downloads the CIFAKE dataset (120k images) and trains EfficientNet-B0:
make dev # Install dependencies
dvc repro # Run full pipeline: download data → validate → train
dvc push # Upload the trained model to GCS for the teamAlternatively, run each step manually:
python scripts/download_cifake.py # Download dataset
python -m src.data.validate --data-dir data/processed # Validate data
python -m src.training.train --config configs/train_config.yaml # Train modelThe trained model will be saved to models/checkpoints/best_model.pt.
Run locally:
make serve # Local dev server → http://localhost:8000
docker compose up -d # Full stack (production) → ports belowNote:
make serveruns Uvicorn on port 8000 (local development). Docker Compose exposes the API on port 8080 (container/production).
| Service | URL |
|---|---|
| API (Docker) | http://localhost:8080 |
| Streamlit UI | http://localhost:8501 |
| MLflow | http://localhost:5000 |
| Prometheus | http://localhost:9090 |
| Grafana | http://localhost:3000 (default credentials: admin / admin) |
Test:
make test # pytest with coverage
make lint # ruff + mypyThe application is deployed on Google Cloud Run (serverless). Both the API and Web UI are publicly accessible:
Note: These services may be turned off to avoid unnecessary costs. This is a university project and keeping them running permanently is not required. If a link doesn't work, the service has simply been shut down.
See the full project walkthrough in our presentation slides.
| Service | URL |
|---|---|
| API (Production) | https://ai-product-detector-714127049161.europe-west1.run.app |
| Web UI (Production) | https://ai-product-detector-ui-714127049161.europe-west1.run.app |
| API Documentation | https://ai-product-detector-714127049161.europe-west1.run.app/docs |
| Health Check | https://ai-product-detector-714127049161.europe-west1.run.app/health |
| Metrics (Prometheus) | https://ai-product-detector-714127049161.europe-west1.run.app/metrics |
API Reference
| Method | Endpoint | Description | Rate Limit |
|---|---|---|---|
POST |
/predict |
Single image classification | 30/min |
POST |
/predict/batch |
Batch classification (up to 10 images) | 5/min |
POST |
/predict/explain |
Prediction + Grad-CAM heatmap | 10/min |
GET |
/health |
Readiness probe (model status, uptime, drift) | - |
GET |
/metrics |
Prometheus metrics (text format) | - |
Authentication is optional in development and enforced in production via environment variables.
| Variable | Description |
|---|---|
API_KEYS |
Comma-separated list of valid API keys |
REQUIRE_AUTH |
Set to true to enforce authentication |
Pass the key via header: X-API-Key: YOUR_KEY
{
"prediction": "ai_generated",
"probability": 0.87,
"confidence": "high",
"inference_time_ms": 45.2,
"model_version": "1.0.0"
}Monitoring
/metrics (raw text) → Prometheus (collection & storage) → Grafana (dashboards & alerts)
The API exposes raw Prometheus metrics at /metrics. Prometheus scrapes this endpoint at regular intervals and stores the time-series data. Grafana connects to Prometheus as a datasource to render real-time dashboards and trigger alerts.
All exposed at GET /metrics in Prometheus text format:
| Metric | Type | Description |
|---|---|---|
aidetect_predictions_total |
Counter | Total predictions by status, class, confidence |
aidetect_prediction_latency_seconds |
Histogram | Per-prediction latency distribution |
aidetect_prediction_probability |
Histogram | Probability score distribution |
aidetect_batch_predictions_total |
Counter | Batch request count |
aidetect_batch_size |
Histogram | Images per batch request |
aidetect_model_loaded |
Gauge | Model load status (0/1) |
http_request_duration_seconds |
Histogram | HTTP latency by endpoint |
http_requests_total |
Counter | HTTP requests by method, endpoint, status |
Pre-configured and auto-provisioned via configs/grafana/provisioning/:
- Request throughput - Requests/sec by endpoint
- Latency percentiles - p50, p90, p99 per endpoint
- Prediction distribution - Real vs AI-generated ratio over time
- Model health - Load status, drift alerts, error rates
Default credentials: admin / admin
Real-time monitoring of prediction distribution shifts using a sliding window over the last 1,000 predictions. Tracks mean probability, confidence distribution, and class ratios. Configurable alert thresholds with status available at GET /drift.
| Layer | Technologies |
|---|---|
| ML | PyTorch 2.0+, torchvision, timm (EfficientNet-B0), Grad-CAM |
| API | FastAPI, Uvicorn, Pydantic v2, slowapi |
| MLOps | DVC (pipelines + versioning), MLflow (experiment tracking), HuggingFace Datasets |
| Monitoring | Prometheus, Grafana, structlog (JSON logging), custom drift detection |
| Infrastructure | Docker, Docker Compose, Terraform (modular), Cloud Run, Artifact Registry |
| CI/CD | GitHub Actions (CI, CD, Model Training, PR Preview, Request Quota) |
| Cloud | Google Cloud Platform (Vertex AI, Cloud Run, GCS, Artifact Registry, Secret Manager) |
Project Structure
AI-Product-Photo-Detector/
├── .github/workflows/
│ ├── ci.yml # Lint + type-check + test (3.11, 3.12) + security
│ ├── cd.yml # Build → push → deploy → smoke test
│ ├── model-training.yml # Vertex AI GPU training pipeline
│ ├── pr-preview.yml # PR preview deployments
│ └── request-quota.yml # GCP quota increase requests
├── configs/
│ ├── grafana/ # Dashboard definitions + provisioning
│ ├── prometheus/ # Alerting rules
│ ├── inference_config.yaml # API server configuration
│ ├── pipeline_config.yaml # Vertex AI pipeline parameters
│ ├── prometheus.yml # Prometheus scrape targets
│ └── train_config.yaml # Training hyperparameters
├── docker/
│ ├── Dockerfile # Production API image (non-root)
│ ├── Dockerfile.training # Vertex AI GPU training image
│ └── ui.Dockerfile # Streamlit UI image
├── docs/
│ ├── architecture.svg # System architecture diagram
│ ├── ARCHITECTURE.md # Design decisions
│ ├── CICD.md # CI/CD pipeline docs
│ ├── CONTRIBUTING.md # Contribution guidelines
│ ├── COSTS.md # Cloud cost analysis
│ ├── DEPLOYMENT.md # Deployment guide
│ ├── INFRASTRUCTURE.md # Infrastructure docs
│ ├── MONITORING.md # Monitoring guide
│ └── TRAINING.md # Training pipeline docs
├── notebooks/
│ └── train_colab.ipynb # Colab notebook (free T4 GPU)
├── scripts/ # Dataset download & sample data utilities
├── src/
│ ├── data/
│ │ └── validate.py # Dataset validation & integrity checks
│ ├── inference/
│ │ ├── api.py # FastAPI application & routes
│ │ ├── auth.py # API key auth (HMAC, constant-time)
│ │ ├── explainer.py # Grad-CAM heatmap generation
│ │ ├── predictor.py # Model inference engine
│ │ ├── rate_limit.py # Rate limiting configuration
│ │ ├── routes/
│ │ │ ├── v1/ # Versioned API endpoints
│ │ ├── schemas.py # Pydantic request/response models
│ │ ├── shadow.py # Shadow model A/B testing
│ │ ├── state.py # Application state management
│ │ └── validation.py # Image validation utilities
│ ├── monitoring/
│ │ ├── drift.py # Real-time drift detection
│ │ └── metrics.py # Prometheus metric definitions
│ ├── pipelines/
│ │ ├── evaluate.py # Model evaluation stage
│ │ └── training_pipeline.py # End-to-end training orchestrator
│ ├── training/
│ │ ├── augmentation.py # Data augmentation transforms
│ │ ├── dataset.py # PyTorch Dataset implementation
│ │ ├── gcs.py # GCS upload/download helpers
│ │ ├── model.py # EfficientNet-B0 architecture
│ │ ├── train.py # Training loop with MLflow tracking
│ │ └── vertex_submit.py # Vertex AI job submission CLI
│ ├── ui/
│ │ └── app.py # Streamlit web interface
│ └── utils/
│ ├── config.py # Settings management (Pydantic Settings)
│ ├── logger.py # Structured logging setup
│ └── model_loader.py # Model loading utilities
├── terraform/
│ ├── environments/
│ │ ├── dev/ # Development environment
│ │ └── prod/ # Production environment
│ ├── modules/
│ │ ├── cloud-run/ # Cloud Run service module
│ │ ├── iam/ # IAM bindings module
│ │ ├── monitoring/ # Monitoring module
│ │ ├── registry/ # Artifact Registry module
│ │ └── storage/ # GCS bucket module
│ ├── backend.tf # Terraform state backend (GCS)
│ └── versions.tf # Provider version constraints
├── tests/
│ ├── load/ # Locust + k6 load tests
│ ├── conftest.py # Shared test fixtures
│ └── test_*.py # 28+ test modules (API, auth, model, training, ...)
├── docker-compose.yml # Full stack: API + UI + MLflow + Prometheus + Grafana
├── dvc.yaml # DVC pipeline: download → validate → train
├── Makefile # Development commands
├── pyproject.toml # Dependencies & tool config
└── LICENSE # MIT License
| Document | Description |
|---|---|
| Architecture | System architecture and design decisions |
| Training Guide | Training pipeline documentation (all 3 modes) |
| Deployment | Deployment guide |
| Monitoring | Monitoring and observability guide |
| CI/CD | CI/CD pipeline documentation |
| Infrastructure | Infrastructure and Terraform documentation |
| Costs | Cloud cost analysis |
| Contributing | Contribution guidelines |
MIT License - see LICENSE for details.
Made by Nolan Cacheux


















