Skip to content

Deep learning system achieving 95.36% accuracy in media forgery detection using hybrid ResNet50+ViT architecture. Optimized for 20% training data efficiency with PyTorch, OpenCV, and Flask-based inference pipeline.

License

Notifications You must be signed in to change notification settings

aminshennan/Forgery_Detection_with_ViT_and_RestNet

Repository files navigation

πŸ” Forgery Detection System

Python PyTorch License Build Version

A state-of-the-art deep learning system for detecting forged, manipulated, and synthetic media content using a hybrid ResNet50 + Vision Transformer (ViT) architecture.

πŸš€ Quick Start β€’ πŸ“– Documentation β€’ 🎯 Demo β€’ 🀝 Contributing


🎯 Project Overview

This project implements a comprehensive forgery detection system capable of classifying media content into three distinct categories with high accuracy. A key contribution of this research is the data efficiency study that identified the optimal training data size, demonstrating that 20% of available data achieves peak performance (95.36% validation accuracy) while maintaining computational efficiency. Additionally, the system leverages a complementary feature extraction approach, combining ResNet50 for local pattern analysis with Vision Transformer for global context modeling.

Category Description Examples
🟒 Real Authentic, unmodified content Original photos, genuine videos
πŸ”΄ Fake Synthetically generated content StyleGAN, VQGAN, AI-generated images
🟑 Edited Manipulated authentic content Deepfakes, face swaps, Wav2Lip

🌟 Key Features

  • Multi-class Classification: Distinguishes between real, fake, and edited content with 95.36% validation accuracy
  • Complementary Feature Extraction: Combines ResNet50 (local features) with Vision Transformer (global context)
  • Data Efficiency Optimization: Achieved optimal performance using only 20% of available training data
  • Class-Specific Feature Analysis: Identifies and leverages different image regions for different forgery types
  • Hybrid Architecture: 6-layer Vision Transformer (8 attention heads, 256 embedding dim) built on ResNet50 features
  • Comprehensive Study: Systematic evaluation of training data requirements (5%, 10%, 15%, 20%, 25%)
  • Video Support: Frame-by-frame analysis with temporal aggregation
  • Web Interface: User-friendly web application with drag & drop functionality
  • Real-time Processing: Optimized for both batch and real-time inference
  • Resource Efficient: Reduced training time and computational requirements through optimal data utilization

πŸ“Š Actual Dataset Statistics

Based on the verification report from our processed dataset:

  • Total Samples: 139,256 images
  • Training Set: 111,417 samples (80%)
  • Validation Set: 13,916 samples (10%)
  • Test Set: 13,923 samples (10%)

Exact Class Distribution:

Split Edited Fake Real Total
Train 44,551 (40.0%) 41,350 (37.1%) 25,516 (22.9%) 111,417
Validation 5,559 (39.9%) 5,168 (37.1%) 3,189 (22.9%) 13,916
Test 5,560 (39.9%) 5,170 (37.1%) 3,193 (22.9%) 13,923
Total 55,670 (40.0%) 51,688 (37.1%) 31,898 (22.9%) 139,256

Class Imbalance Ratio: 1.75 (Majority:Minority)

πŸ—οΈ Architecture & Technical Design

πŸ”„ Feature Extraction Approach

This system leverages a complementary dual-feature extraction strategy:

  1. Local Feature Extraction (ResNet50): Captures fine-grained local patterns and textures that may indicate manipulation, including compression artifacts, noise inconsistencies, and edge anomalies at the pixel level.

  2. Global Feature Integration (Vision Transformer): Analyzes relationships between distant image regions, capturing semantic inconsistencies and global context that may not be apparent locally. The self-attention mechanism effectively models long-range dependencies in the feature space.

This complementary approach allows the model to simultaneously reason about both local manipulation artifacts and global image coherence, resulting in more robust forgery detection.

πŸ”„ Processing Pipeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        FORGERY DETECTION PIPELINE                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                             β”‚
β”‚  INPUT STAGE                                                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  Image/Video β†’ Preprocessing β†’ Normalization β†’ Resizing (224Γ—224)   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                    β”‚                                        β”‚
β”‚                                    β–Ό                                        β”‚
β”‚  FEATURE EXTRACTION                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚                         ResNet50 Backbone                           β”‚    β”‚
β”‚  β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚    β”‚
β”‚  β”‚   β”‚ Conv1   β”‚β†’ β”‚ Layer1  β”‚β†’ β”‚ Layer2  β”‚β†’ β”‚ Layer3  β”‚β†’ β”‚ Layer4  β”‚   β”‚    β”‚
β”‚  β”‚   β”‚ 64Γ—112Β² β”‚  β”‚ 256Γ—56Β² β”‚  β”‚ 512Γ—28Β² β”‚  β”‚1024Γ—14Β² β”‚  β”‚2048Γ—7Β²  β”‚   β”‚    β”‚
β”‚  β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                    β”‚                                        β”‚
β”‚                                    β–Ό                                        β”‚
β”‚  TRANSFORMER PROCESSING                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  Patch Embedding β†’ Positional Encoding β†’ Multi-Head Attention       β”‚    β”‚
β”‚  β”‚       β”‚                    β”‚                       β”‚                β”‚    β”‚
β”‚  β”‚       β–Ό                    β–Ό                       β–Ό                β”‚    β”‚
β”‚  β”‚  [2048,7,7] β†’ [256,49] β†’ [256,49] β†’ Self-Attention Layers (Γ—6)      β”‚    β”‚
β”‚  β”‚                                                    β”‚                β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β”‚    β”‚
β”‚  β”‚  β”œβ”€ Layer 1: Query, Key, Value Matrices                             β”‚    β”‚
β”‚  β”‚  β”œβ”€ Layer 2: Multi-Head Attention (8 heads)                         β”‚    β”‚
β”‚  β”‚  β”œβ”€ Layer 3: Feed-Forward Network (1024 dim)                        β”‚    β”‚
β”‚  β”‚  β”œβ”€ Layer 4: Residual Connections                                   β”‚    β”‚
β”‚  β”‚  β”œβ”€ Layer 5: Layer Normalization                                    β”‚    β”‚
β”‚  β”‚  └─ Layer 6: Global Context Integration                             β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                    β”‚                                        β”‚
β”‚                                    β–Ό                                        β”‚
β”‚  CLASSIFICATION                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚   Global Pooling β†’ Linear Layer β†’ Softmax β†’ [Real, Fake, Edited]    β”‚    β”‚
β”‚  β”‚       [256]    β†’    [256,3]   β†’   [3]   β†’   Confidence Scores       β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                    β”‚                                        β”‚
β”‚                                    β–Ό                                        β”‚
β”‚  OUTPUT                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚   Prediction Class + Confidence Score + Per-Class Probabilities     β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”§ Technical Specifications

Component Specification Details
πŸ—οΈ Backbone ResNet50 Pre-trained on ImageNet, extracts local textures and patterns
πŸ”„ Transformer 6-layer encoder 8 attention heads, 256 embedding dim, models global image context
πŸ”— Integration Complementary fusion Local features (ResNet) + Global features (ViT)
πŸ“ Input Size 224Γ—224Γ—3 RGB images, normalized to ImageNet stats
🎯 Output Classes 3 categories Real, Fake, Edited with confidence scores
⚑ Framework PyTorch 1.12+ TorchScript optimized for deployment
πŸ’Ύ Model Size 167MB TorchScript compiled model
πŸš€ Deployment TorchScript Cross-platform inference optimization

πŸ“ Project Structure

Forgery_Detection_final/
β”œβ”€β”€ πŸ“ Interface/                          # 🌐 Web Application Layer
β”‚   β”œβ”€β”€ 🐍 server.py                       # Flask web server & API endpoints
β”‚   β”œβ”€β”€ πŸ” forgery_detector.py             # Core detection logic & model interface
β”‚   β”œβ”€β”€ 🌐 index.html                      # Main web interface (drag & drop)
β”‚   β”œβ”€β”€ πŸ“„ about.html                      # Project documentation page
β”‚   β”œβ”€β”€ πŸ“ static/                         # CSS, JS, and static assets
β”‚   β”œβ”€β”€ πŸ“ uploads/                        # Temporary uploaded files
β”‚   β”œβ”€β”€ πŸ“ results/                        # Processing results & outputs
β”‚   └── πŸ€– forgery_detection_model.pt      # Trained model (167MB TorchScript)
β”œβ”€β”€ πŸ“ deployment/                         # πŸš€ Production Deployment
β”‚   β”œβ”€β”€ πŸ”§ inference.py                    # Standalone inference script
β”‚   └── 🐳 Dockerfile                      # Container configuration
β”œβ”€β”€ πŸ“ models/                             # πŸ’Ύ Model Artifacts & Visualizations
β”‚   β”œβ”€β”€ πŸ“ checkpoints/                    # Training checkpoints & weights
β”‚   β”œβ”€β”€ πŸ“Š predictions_0.0500__model.png   # 5% model predictions
β”‚   β”œβ”€β”€ πŸ“Š predictions_0.1000__model.png   # 10% model predictions
β”‚   β”œβ”€β”€ πŸ“Š predictions_0.1500__model.png   # 15% model predictions
β”‚   β”œβ”€β”€ πŸ“Š predictions_0.2000__model.png   # 20% model predictions (selected)
β”‚   └── πŸ“Š predictions_0.2500__model.png   # 25% model predictions
β”œβ”€β”€ πŸ“ processed_data/                     # πŸ“Š Processed Datasets (139,256 samples)
β”‚   β”œβ”€β”€ πŸ“ train/                          # Training data (111,417 samples)
β”‚   β”‚   β”œβ”€β”€ πŸ“ real/                       # Authentic images 
β”‚   β”‚   β”œβ”€β”€ πŸ“ fake/                       # Synthetic images 
β”‚   β”‚   └── πŸ“ edited/                     # Manipulated images 
β”‚   β”œβ”€β”€ πŸ“ val/                            # Validation data (13,916 samples)
β”‚   β”‚   β”œβ”€β”€ πŸ“ real/                       # Authentic images 
β”‚   β”‚   β”œβ”€β”€ πŸ“ fake/                       # Synthetic images 
β”‚   β”‚   └── πŸ“ edited/                     # Manipulated images 
β”‚   β”œβ”€β”€ πŸ“ test/                           # Test data (13,923 samples)
β”‚   β”‚   β”œβ”€β”€ πŸ“ real/                       # Authentic images 
β”‚   β”‚   β”œβ”€β”€ πŸ“ fake/                       # Synthetic images 
β”‚   β”‚   └── πŸ“ edited/                     # Manipulated images 
β”‚   β”œβ”€β”€ πŸ“‹ train_metadata.csv              # Training set metadata
β”‚   β”œβ”€β”€ πŸ“‹ val_metadata.csv                # Validation set metadata
β”‚   β”œβ”€β”€ πŸ“‹ test_metadata.csv               # Test set metadata
β”‚   └── πŸ“Š verification_report.json        # Dataset statistics & validation
β”œβ”€β”€ πŸ“ training_logs/                      # πŸ“ˆ Training Metrics & Visualizations
β”‚   β”œβ”€β”€ πŸ“Š tsne_visualization_0.2000.png   # t-SNE feature visualization (20% model)
β”‚   β”œβ”€β”€ πŸ“Š confusion_matrix_0.2000.png     # Confusion matrix (20% model)
β”‚   β”œβ”€β”€ πŸ“Š accuracy_vs_data_percent.png    # Data efficiency comparison
β”‚   β”œβ”€β”€ πŸ“Š learning_curves_comparison.png  # Training curves across all models
β”‚   β”œβ”€β”€ πŸ“Š f1_scores_comparison.png        # F1-score comparison
β”‚   β”œβ”€β”€ πŸ“ˆ metrics_log_sample_0.2000.json  # Actual training metrics (20% model)
β”‚   └── πŸ“ˆ [other model metrics...]        # Metrics for 5%, 10%, 15%, 25% models
β”œβ”€β”€ πŸ““ Traning RestNet + ViT.ipynb         # 🧠 Model training notebook
β”œβ”€β”€ πŸ““ Data_Preparation.ipynb              # πŸ”„ Data preprocessing notebook
β”œβ”€β”€ πŸ“‹ requirements.txt                    # πŸ“¦ Python dependencies
β”œβ”€β”€ βš™οΈ setup.py                            # πŸ› οΈ Package installation script
β”œβ”€β”€ πŸš€ quick_start.py                      # ⚑ Quick setup and demo script
└── πŸ“– README.md                           # πŸ“š This comprehensive guide

πŸš€ Installation & Setup

πŸ“‹ System Requirements

Component Minimum Recommended Notes
🐍 Python 3.8+ 3.9+ Required for PyTorch compatibility
πŸ’Ύ RAM 8GB 16GB+ For model loading and processing
πŸ’Ώ Storage 10GB 20GB+ Models, datasets, and dependencies
πŸ–₯️ GPU Optional CUDA 11.0+ Significantly faster inference with GPU
🌐 OS Windows/Linux/macOS Linux Ubuntu 20.04+ Cross-platform support

⚑ Quick Installation

# 1️⃣ Clone the repository
git clone <repository-url>
cd Forgery_Detection_final

# 2️⃣ Create virtual environment (recommended)
python -m venv forgery_env
source forgery_env/bin/activate  # Linux/macOS
# forgery_env\Scripts\activate   # Windows

# 3️⃣ Install dependencies
pip install -r requirements.txt

# 4️⃣ Run quick start script
python quick_start.py --setup

# 5️⃣ Launch web interface
cd Interface
python server.py

πŸ“¦ Core Dependencies

# Core ML and Deep Learning
torch>=1.12.0
torchvision>=0.13.0
torchaudio>=0.12.0

# Web Framework
flask>=2.0.0
flask-cors>=3.0.0

# Image and Video Processing
pillow>=8.0.0
opencv-python>=4.5.0

# Data Science and Utilities
numpy>=1.21.0
pandas>=1.3.0
matplotlib>=3.5.0
scikit-learn>=1.0.0
tqdm>=4.62.0

# Development and Jupyter
jupyter>=1.0.0
notebook>=6.4.0

# Optional: Production deployment
gunicorn>=20.1.0

πŸ’» Usage & Interface Guide

🌐 Web Interface Usage

1. Start the Web Server

cd Interface
python server.py

# Server will start at http://localhost:5000
# Access from any web browser

2. Upload and Analyze

  • Supported Formats: JPG, PNG, MP4, AVI, MOV
  • File Size Limits: Images (10MB), Videos (100MB)
  • Features: Drag & drop upload, real-time processing, downloadable results

πŸ’» Command Line Interface

Single Image Analysis

cd deployment
python inference.py forgery_detection_model.pt path/to/image.jpg

# Output:
# Prediction: fake
# Confidence: 87.3%
# Probabilities: Real=8.2%, Fake=87.3%, Edited=4.5%

Video Analysis

cd Interface
python forgery_detector.py --model forgery_detection_model.pt \
                          --input video.mp4 \
                          --output annotated_video.mp4 \
                          --sample-rate 0.2

πŸ”— Python API Integration

from Interface.forgery_detector import ForgeryDetector

# Initialize detector
detector = ForgeryDetector("Interface/forgery_detection_model.pt")

# Analyze single image
result = detector.predict_image_file("path/to/image.jpg")
print(f"Prediction: {result['prediction']}")
print(f"Confidence: {result['confidence']:.2f}%")

🎯 Performance Metrics & Data Efficiency Study

πŸ“Š Data Efficiency Study Results

This study investigated the optimal amount of training data required for effective forgery detection. We trained models using different percentages of the available dataset to determine the best balance between data efficiency and performance.

πŸ“ˆ Actual Training Results

Data % Training Samples Best Validation Accuracy Final Training Loss Final Validation Loss Epochs Data Efficiency Score
5% 5,571 94.09% 0.128 0.140 6 168.9
10% 11,142 94.53% 0.109 0.120 6 84.8
15% 16,713 94.92% 0.107 0.124 6 56.8
20% ⭐ 22,283 95.36% 0.106 0.112 6 42.8
25% 27,854 94.71% 0.104 0.121 6 34.0

Data Efficiency Score = (Validation Accuracy Γ— 1000) / Training Samples

πŸ† Why 20% Model Was Selected

  1. Peak Performance: Achieved the highest validation accuracy (95.36%) among all tested models
  2. Optimal Data Efficiency: Best balance between performance and training data requirements
  3. Best Generalization: Lowest validation loss (0.112) indicating good generalization
  4. Diminishing Returns: 25% model showed decreased performance (-0.65%) despite 25% more training data
  5. Class-Balanced Performance: Best performance across all three classes (Real: 95.1% F1, Fake: 95.8% F1, Edited: 95.2% F1)
  6. Training Stability: Consistent convergence across 6 epochs without overfitting

πŸ“Š Feature Visualization: t-SNE Analysis

t-SNE Visualization

t-SNE visualization of feature embeddings from the 20% model showing clear separation between real (blue), fake (orange), and edited (green) classes in the learned feature space.

πŸ“Š Class-Specific Feature Importance

Class Feature Importance

Class-specific feature importance visualization for the 20% model, showing which image regions contribute most to classification decisions for real, fake, and edited images. Brighter areas (yellow/white) indicate regions with higher importance for class prediction. Note how the model focuses on different areas for different forgery types.

πŸ“Š Model Performance Comparison

Accuracy vs Data Percentage

Performance curve showing validation accuracy across different training data percentages, highlighting the optimal point at 20%.

πŸ“Š Learning Curves Analysis

Learning Curves Comparison

Training and validation curves for all models, demonstrating the superior convergence of the 20% model.

πŸ“Š Confusion Matrix (20% Model)

Confusion Matrix

Confusion matrix for the selected 20% model showing classification performance across all three classes.

πŸ“Š F1-Score Comparison

F1-Scores Comparison

F1-score comparison across different data percentages for each class, confirming the superiority of the 20% model.

πŸ“Š Class-Specific Performance Analysis

To provide deeper insight into the model performance, we analyzed how each data percentage model performed across the three classes:

Per-Class Performance Metrics (Validation Set)

Model Class Precision Recall F1-Score Support
5% Model Real 92.5% 94.7% 93.6% 3,189
Fake 95.6% 93.1% 94.3% 5,168
Edited 94.2% 94.5% 94.3% 5,559
Average 94.1% 94.1% 94.1% 13,916
10% Model Real 93.1% 95.2% 94.1% 3,189
Fake 96.1% 93.7% 94.9% 5,168
Edited 94.3% 94.7% 94.5% 5,559
Average 94.5% 94.5% 94.5% 13,916
15% Model Real 93.8% 95.7% 94.7% 3,189
Fake 96.3% 94.2% 95.2% 5,168
Edited 94.6% 94.9% 94.8% 5,559
Average 94.9% 94.9% 94.9% 13,916
20% Model ⭐ Real 94.2% 96.1% 95.1% 3,189
Fake 96.8% 94.8% 95.8% 5,168
Edited 95.1% 95.3% 95.2% 5,559
Average 95.4% 95.4% 95.4% 13,916
25% Model Real 93.5% 95.6% 94.5% 3,189
Fake 96.2% 94.0% 95.1% 5,168
Edited 94.5% 94.7% 94.6% 5,559
Average 94.7% 94.7% 94.7% 13,916

Key Class-Specific Observations:

  1. Consistent Performance Across Classes: All models maintain relatively balanced performance across the three classes, with no significant bias toward any particular class despite the class imbalance in the dataset.

  2. Fake Detection Precision: Notably, the fake class consistently shows the highest precision across all models, indicating the model's strong ability to avoid false positives when identifying synthetically generated content.

  3. Real Class Recognition: The real class exhibits the highest recall in all models, suggesting the model is especially effective at identifying authentic content.

  4. 20% Model Superiority: The 20% model achieves the best performance across all classes and metrics, confirming that this is the optimal data point for all forgery types.

  5. Class Imbalance Handling: Despite the 1.75:1 class imbalance ratio in the dataset, all models maintain balanced performance across classes, demonstrating effective class-balanced training strategies.

πŸ§ͺ Training Methodology

Experimental Setup

Dataset Preparation:

  • Total available samples: 139,256 (Real: 31,898, Fake: 51,688, Edited: 55,670)
  • Split ratio: 80% train, 10% validation, 10% test
  • Consistent preprocessing across all data percentage experiments
  • Stratified sampling to maintain class distribution

Training Configuration:

# Consistent hyperparameters across all experiments
BATCH_SIZE = 32
LEARNING_RATE = 1e-4
EPOCHS = 6  # Early convergence achieved
OPTIMIZER = "AdamW"
SCHEDULER = "CosineAnnealingLR"
WEIGHT_DECAY = 1e-4

Model Architecture:

  • ResNet50 backbone (pre-trained on ImageNet) for local feature extraction
  • Vision Transformer encoder (6 layers, 8 attention heads, 256 embedding dim) for global context integration
  • Complementary feature extraction: ResNet captures local textures and patterns, ViT models long-range dependencies
  • Input resolution: 224Γ—224Γ—3
  • Output classes: 3 (Real, Fake, Edited)

πŸ” Key Research Insights

Data Efficiency Findings:

  1. Optimal Data Point: 20% of available data (22,283 samples) achieves peak performance
  2. Diminishing Returns: Performance plateaus and even decreases beyond 20% training data
  3. Resource Optimization: 80% reduction in training data with superior performance
  4. Generalization: Lower validation loss indicates better model generalization

Architectural Insights:

  1. Complementary Feature Extraction: The combination of ResNet (local features) and ViT (global features) provides comprehensive image analysis
  2. Feature Importance Distribution: As shown in the class feature importance visualization, the model learns to focus on different image regions for different forgery types
  3. Feature Separability: The t-SNE visualization demonstrates excellent separation between class embeddings, indicating robust feature learning
  4. Local-Global Synergy: Local patterns from ResNet combined with global context from ViT creates a more complete understanding of image manipulation cues

Implications for Practitioners:

  • Cost-Effective Training: Achieve state-of-the-art results with significantly less data
  • Faster Iteration: Reduced training time enables rapid experimentation
  • Resource Planning: Clear guidelines for dataset collection and annotation efforts
  • Transfer Learning: Framework applicable to other computer vision tasks

πŸ› οΈ Troubleshooting & FAQ

🚨 Common Issues & Solutions

Issue Symptoms Solution
πŸ”΄ Model Loading Error FileNotFoundError: forgery_detection_model.pt Ensure model file is in Interface/ directory
🟑 CUDA Out of Memory RuntimeError: CUDA out of memory Reduce batch size or use CPU inference
πŸ”΅ Video Processing Fails cv2.error: Could not open video Check codec compatibility, convert to MP4
🟠 Web Interface Not Loading Connection refused on port 5000 Check port availability, try different port

πŸ”§ Performance Optimization

# Check GPU availability
python -c "import torch; print(torch.cuda.is_available())"

# Monitor GPU usage
nvidia-smi -l 1

# Test model loading
python -c "
from Interface.forgery_detector import ForgeryDetector
detector = ForgeryDetector('Interface/forgery_detection_model.pt')
print('Model loaded successfully')
"

πŸ”’ Security & Privacy

πŸ›‘οΈ Security Features

  • Local Processing: All data processed locally, no external server communication
  • Automatic Cleanup: Uploaded files automatically deleted after processing
  • No Data Logging: Input content is not stored or logged
  • Secure File Handling: Input validation and sanitization
  • File Type Validation: Only allowed formats accepted
  • Size Limits: Prevents DoS attacks through large file uploads

πŸš€ Future Roadmap

🎯 Planned Features

  • Real-time webcam analysis for live video streams
  • Audio deepfake detection for voice synthesis
  • Mobile applications for iOS and Android
  • Explainable AI with attention visualization

πŸ”¬ Research Directions

Completed Research Contributions:

  • βœ… Data Efficiency Analysis: Systematic study identifying optimal training data requirements (20% of available data)
  • βœ… Performance Plateau Identification: Demonstrated diminishing returns beyond 20% training data
  • βœ… Resource Optimization Framework: Established methodology for cost-effective model training

Future Directions:

  • Cross-dataset Generalization: Improving performance across different data sources
  • Few-shot Learning: Adapting to new manipulation techniques with minimal data
  • Temporal Consistency: Leveraging video temporal information for better detection
  • Multimodal Fusion: Combining visual, audio, and metadata for comprehensive analysis

πŸ“ž Support & Community

🌟 Contributing

We welcome contributions! Here's how you can help:

  1. πŸ› Report Bugs: Use GitHub issues with detailed reproduction steps
  2. πŸ’‘ Suggest Features: Share your ideas in GitHub discussions
  3. πŸ”§ Submit Code: Fork, develop, and submit pull requests
  4. πŸ“– Improve Docs: Help make documentation clearer
  5. πŸ§ͺ Test & Validate: Help test new features and edge cases

πŸ™ Acknowledgments

πŸŽ“ Research Foundation

  • Vision Transformer: Dosovitskiy, A., et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR 2021
  • ResNet Architecture: He, K., et al. "Deep Residual Learning for Image Recognition." CVPR 2016
  • Deepfake Detection: Li, Y., et al. "In Ictu Oculi: Exposing AI Generated Fake Face Videos by Detecting Eye Blinking." WIFS 2018

πŸ› οΈ Technology Stack

  • PyTorch Team: For the exceptional deep learning framework
  • Flask Community: For the lightweight and flexible web framework
  • OpenCV Contributors: For comprehensive computer vision tools

πŸ“Š Project Statistics

Lines of Code Model Size Accuracy Dataset

πŸ”— Quick Links

πŸ“– Documentation β€’ πŸš€ Quick Start β€’ 🎯 Demo β€’ 🀝 Contribute


Last Updated: May 2025 | Version: 1.0.0 | Maintainer: Amin Shennan

About

Deep learning system achieving 95.36% accuracy in media forgery detection using hybrid ResNet50+ViT architecture. Optimized for 20% training data efficiency with PyTorch, OpenCV, and Flask-based inference pipeline.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published