🔍 Forgery Detection System

A state-of-the-art deep learning system for detecting forged, manipulated, and synthetic media content using a hybrid ResNet50 + Vision Transformer (ViT) architecture.

🚀 Quick Start • 📖 Documentation • 🎯 Demo • 🤝 Contributing

🎯 Project Overview

This project implements a comprehensive forgery detection system capable of classifying media content into three distinct categories with high accuracy. A key contribution of this research is the data efficiency study that identified the optimal training data size, demonstrating that 20% of available data achieves peak performance (95.36% validation accuracy) while maintaining computational efficiency. Additionally, the system leverages a complementary feature extraction approach, combining ResNet50 for local pattern analysis with Vision Transformer for global context modeling.

Category	Description	Examples
🟢 Real	Authentic, unmodified content	Original photos, genuine videos
🔴 Fake	Synthetically generated content	StyleGAN, VQGAN, AI-generated images
🟡 Edited	Manipulated authentic content	Deepfakes, face swaps, Wav2Lip

🌟 Key Features

Multi-class Classification: Distinguishes between real, fake, and edited content with 95.36% validation accuracy
Complementary Feature Extraction: Combines ResNet50 (local features) with Vision Transformer (global context)
Data Efficiency Optimization: Achieved optimal performance using only 20% of available training data
Class-Specific Feature Analysis: Identifies and leverages different image regions for different forgery types
Hybrid Architecture: 6-layer Vision Transformer (8 attention heads, 256 embedding dim) built on ResNet50 features
Comprehensive Study: Systematic evaluation of training data requirements (5%, 10%, 15%, 20%, 25%)
Video Support: Frame-by-frame analysis with temporal aggregation
Web Interface: User-friendly web application with drag & drop functionality
Real-time Processing: Optimized for both batch and real-time inference
Resource Efficient: Reduced training time and computational requirements through optimal data utilization

📊 Actual Dataset Statistics

Based on the verification report from our processed dataset:

Total Samples: 139,256 images
Training Set: 111,417 samples (80%)
Validation Set: 13,916 samples (10%)
Test Set: 13,923 samples (10%)

Exact Class Distribution:

Split	Edited	Fake	Real	Total
Train	44,551 (40.0%)	41,350 (37.1%)	25,516 (22.9%)	111,417
Validation	5,559 (39.9%)	5,168 (37.1%)	3,189 (22.9%)	13,916
Test	5,560 (39.9%)	5,170 (37.1%)	3,193 (22.9%)	13,923
Total	55,670 (40.0%)	51,688 (37.1%)	31,898 (22.9%)	139,256

Class Imbalance Ratio: 1.75 (Majority:Minority)

🏗️ Architecture & Technical Design

🔄 Feature Extraction Approach

This system leverages a complementary dual-feature extraction strategy:

Local Feature Extraction (ResNet50): Captures fine-grained local patterns and textures that may indicate manipulation, including compression artifacts, noise inconsistencies, and edge anomalies at the pixel level.
Global Feature Integration (Vision Transformer): Analyzes relationships between distant image regions, capturing semantic inconsistencies and global context that may not be apparent locally. The self-attention mechanism effectively models long-range dependencies in the feature space.

This complementary approach allows the model to simultaneously reason about both local manipulation artifacts and global image coherence, resulting in more robust forgery detection.

🔄 Processing Pipeline

┌─────────────────────────────────────────────────────────────────────────────┐
│                        FORGERY DETECTION PIPELINE                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  INPUT STAGE                                                                │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Image/Video → Preprocessing → Normalization → Resizing (224×224)   │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                        │
│                                    ▼                                        │
│  FEATURE EXTRACTION                                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                         ResNet50 Backbone                           │    │
│  │   ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐   │    │
│  │   │ Conv1   │→ │ Layer1  │→ │ Layer2  │→ │ Layer3  │→ │ Layer4  │   │    │
│  │   │ 64×112² │  │ 256×56² │  │ 512×28² │  │1024×14² │  │2048×7²  │   │    │
│  │   └─────────┘  └─────────┘  └─────────┘  └─────────┘  └─────────┘   │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                        │
│                                    ▼                                        │
│  TRANSFORMER PROCESSING                                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  Patch Embedding → Positional Encoding → Multi-Head Attention       │    │
│  │       │                    │                       │                │    │
│  │       ▼                    ▼                       ▼                │    │
│  │  [2048,7,7] → [256,49] → [256,49] → Self-Attention Layers (×6)      │    │
│  │                                                    │                │    │
│  │  ┌─────────────────────────────────────────────────┘                │    │
│  │  ├─ Layer 1: Query, Key, Value Matrices                             │    │
│  │  ├─ Layer 2: Multi-Head Attention (8 heads)                         │    │
│  │  ├─ Layer 3: Feed-Forward Network (1024 dim)                        │    │
│  │  ├─ Layer 4: Residual Connections                                   │    │
│  │  ├─ Layer 5: Layer Normalization                                    │    │
│  │  └─ Layer 6: Global Context Integration                             │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                        │
│                                    ▼                                        │
│  CLASSIFICATION                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │   Global Pooling → Linear Layer → Softmax → [Real, Fake, Edited]    │    │
│  │       [256]    →    [256,3]   →   [3]   →   Confidence Scores       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                        │
│                                    ▼                                        │
│  OUTPUT                                                                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │   Prediction Class + Confidence Score + Per-Class Probabilities     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

🔧 Technical Specifications

Component	Specification	Details
🏗️ Backbone	ResNet50	Pre-trained on ImageNet, extracts local textures and patterns
🔄 Transformer	6-layer encoder	8 attention heads, 256 embedding dim, models global image context
🔗 Integration	Complementary fusion	Local features (ResNet) + Global features (ViT)
📐 Input Size	224×224×3	RGB images, normalized to ImageNet stats
🎯 Output Classes	3 categories	Real, Fake, Edited with confidence scores
⚡ Framework	PyTorch 1.12+	TorchScript optimized for deployment
💾 Model Size	167MB	TorchScript compiled model
🚀 Deployment	TorchScript	Cross-platform inference optimization

📁 Project Structure

Forgery_Detection_final/
├── 📁 Interface/                          # 🌐 Web Application Layer
│   ├── 🐍 server.py                       # Flask web server & API endpoints
│   ├── 🔍 forgery_detector.py             # Core detection logic & model interface
│   ├── 🌐 index.html                      # Main web interface (drag & drop)
│   ├── 📄 about.html                      # Project documentation page
│   ├── 📁 static/                         # CSS, JS, and static assets
│   ├── 📁 uploads/                        # Temporary uploaded files
│   ├── 📁 results/                        # Processing results & outputs
│   └── 🤖 forgery_detection_model.pt      # Trained model (167MB TorchScript)
├── 📁 deployment/                         # 🚀 Production Deployment
│   ├── 🔧 inference.py                    # Standalone inference script
│   └── 🐳 Dockerfile                      # Container configuration
├── 📁 models/                             # 💾 Model Artifacts & Visualizations
│   ├── 📁 checkpoints/                    # Training checkpoints & weights
│   ├── 📊 predictions_0.0500__model.png   # 5% model predictions
│   ├── 📊 predictions_0.1000__model.png   # 10% model predictions
│   ├── 📊 predictions_0.1500__model.png   # 15% model predictions
│   ├── 📊 predictions_0.2000__model.png   # 20% model predictions (selected)
│   └── 📊 predictions_0.2500__model.png   # 25% model predictions
├── 📁 processed_data/                     # 📊 Processed Datasets (139,256 samples)
│   ├── 📁 train/                          # Training data (111,417 samples)
│   │   ├── 📁 real/                       # Authentic images 
│   │   ├── 📁 fake/                       # Synthetic images 
│   │   └── 📁 edited/                     # Manipulated images 
│   ├── 📁 val/                            # Validation data (13,916 samples)
│   │   ├── 📁 real/                       # Authentic images 
│   │   ├── 📁 fake/                       # Synthetic images 
│   │   └── 📁 edited/                     # Manipulated images 
│   ├── 📁 test/                           # Test data (13,923 samples)
│   │   ├── 📁 real/                       # Authentic images 
│   │   ├── 📁 fake/                       # Synthetic images 
│   │   └── 📁 edited/                     # Manipulated images 
│   ├── 📋 train_metadata.csv              # Training set metadata
│   ├── 📋 val_metadata.csv                # Validation set metadata
│   ├── 📋 test_metadata.csv               # Test set metadata
│   └── 📊 verification_report.json        # Dataset statistics & validation
├── 📁 training_logs/                      # 📈 Training Metrics & Visualizations
│   ├── 📊 tsne_visualization_0.2000.png   # t-SNE feature visualization (20% model)
│   ├── 📊 confusion_matrix_0.2000.png     # Confusion matrix (20% model)
│   ├── 📊 accuracy_vs_data_percent.png    # Data efficiency comparison
│   ├── 📊 learning_curves_comparison.png  # Training curves across all models
│   ├── 📊 f1_scores_comparison.png        # F1-score comparison
│   ├── 📈 metrics_log_sample_0.2000.json  # Actual training metrics (20% model)
│   └── 📈 [other model metrics...]        # Metrics for 5%, 10%, 15%, 25% models
├── 📓 Traning RestNet + ViT.ipynb         # 🧠 Model training notebook
├── 📓 Data_Preparation.ipynb              # 🔄 Data preprocessing notebook
├── 📋 requirements.txt                    # 📦 Python dependencies
├── ⚙️ setup.py                            # 🛠️ Package installation script
├── 🚀 quick_start.py                      # ⚡ Quick setup and demo script
└── 📖 README.md                           # 📚 This comprehensive guide

🚀 Installation & Setup

📋 System Requirements

Component	Minimum	Recommended	Notes
🐍 Python	3.8+	3.9+	Required for PyTorch compatibility
💾 RAM	8GB	16GB+	For model loading and processing
💿 Storage	10GB	20GB+	Models, datasets, and dependencies
🖥️ GPU	Optional	CUDA 11.0+	Significantly faster inference with GPU
🌐 OS	Windows/Linux/macOS	Linux Ubuntu 20.04+	Cross-platform support

⚡ Quick Installation

# 1️⃣ Clone the repository
git clone <repository-url>
cd Forgery_Detection_final

# 2️⃣ Create virtual environment (recommended)
python -m venv forgery_env
source forgery_env/bin/activate  # Linux/macOS
# forgery_env\Scripts\activate   # Windows

# 3️⃣ Install dependencies
pip install -r requirements.txt

# 4️⃣ Run quick start script
python quick_start.py --setup

# 5️⃣ Launch web interface
cd Interface
python server.py

📦 Core Dependencies

# Core ML and Deep Learning
torch>=1.12.0
torchvision>=0.13.0
torchaudio>=0.12.0

# Web Framework
flask>=2.0.0
flask-cors>=3.0.0

# Image and Video Processing
pillow>=8.0.0
opencv-python>=4.5.0

# Data Science and Utilities
numpy>=1.21.0
pandas>=1.3.0
matplotlib>=3.5.0
scikit-learn>=1.0.0
tqdm>=4.62.0

# Development and Jupyter
jupyter>=1.0.0
notebook>=6.4.0

# Optional: Production deployment
gunicorn>=20.1.0

💻 Usage & Interface Guide

🌐 Web Interface Usage

1. Start the Web Server

cd Interface
python server.py

# Server will start at http://localhost:5000
# Access from any web browser

2. Upload and Analyze

Supported Formats: JPG, PNG, MP4, AVI, MOV
File Size Limits: Images (10MB), Videos (100MB)
Features: Drag & drop upload, real-time processing, downloadable results

💻 Command Line Interface

Single Image Analysis

cd deployment
python inference.py forgery_detection_model.pt path/to/image.jpg

# Output:
# Prediction: fake
# Confidence: 87.3%
# Probabilities: Real=8.2%, Fake=87.3%, Edited=4.5%

Video Analysis

cd Interface
python forgery_detector.py --model forgery_detection_model.pt \
                          --input video.mp4 \
                          --output annotated_video.mp4 \
                          --sample-rate 0.2

🔗 Python API Integration

from Interface.forgery_detector import ForgeryDetector

# Initialize detector
detector = ForgeryDetector("Interface/forgery_detection_model.pt")

# Analyze single image
result = detector.predict_image_file("path/to/image.jpg")
print(f"Prediction: {result['prediction']}")
print(f"Confidence: {result['confidence']:.2f}%")

🎯 Performance Metrics & Data Efficiency Study

📊 Data Efficiency Study Results

This study investigated the optimal amount of training data required for effective forgery detection. We trained models using different percentages of the available dataset to determine the best balance between data efficiency and performance.

📈 Actual Training Results

Data %	Training Samples	Best Validation Accuracy	Final Training Loss	Final Validation Loss	Epochs	Data Efficiency Score
5%	5,571	94.09%	0.128	0.140	6	168.9
10%	11,142	94.53%	0.109	0.120	6	84.8
15%	16,713	94.92%	0.107	0.124	6	56.8
20% ⭐	22,283	95.36%	0.106	0.112	6	42.8
25%	27,854	94.71%	0.104	0.121	6	34.0

Data Efficiency Score = (Validation Accuracy × 1000) / Training Samples

🏆 Why 20% Model Was Selected

Peak Performance: Achieved the highest validation accuracy (95.36%) among all tested models
Optimal Data Efficiency: Best balance between performance and training data requirements
Best Generalization: Lowest validation loss (0.112) indicating good generalization
Diminishing Returns: 25% model showed decreased performance (-0.65%) despite 25% more training data
Class-Balanced Performance: Best performance across all three classes (Real: 95.1% F1, Fake: 95.8% F1, Edited: 95.2% F1)
Training Stability: Consistent convergence across 6 epochs without overfitting

📊 Feature Visualization: t-SNE Analysis

t-SNE visualization of feature embeddings from the 20% model showing clear separation between real (blue), fake (orange), and edited (green) classes in the learned feature space.

📊 Class-Specific Feature Importance

Class-specific feature importance visualization for the 20% model, showing which image regions contribute most to classification decisions for real, fake, and edited images. Brighter areas (yellow/white) indicate regions with higher importance for class prediction. Note how the model focuses on different areas for different forgery types.

📊 Model Performance Comparison

Performance curve showing validation accuracy across different training data percentages, highlighting the optimal point at 20%.

📊 Learning Curves Analysis

Training and validation curves for all models, demonstrating the superior convergence of the 20% model.

📊 Confusion Matrix (20% Model)

Confusion matrix for the selected 20% model showing classification performance across all three classes.

📊 F1-Score Comparison

F1-score comparison across different data percentages for each class, confirming the superiority of the 20% model.

📊 Class-Specific Performance Analysis

To provide deeper insight into the model performance, we analyzed how each data percentage model performed across the three classes:

Per-Class Performance Metrics (Validation Set)

Model	Class	Precision	Recall	F1-Score	Support
5% Model	Real	92.5%	94.7%	93.6%	3,189
	Fake	95.6%	93.1%	94.3%	5,168
	Edited	94.2%	94.5%	94.3%	5,559
	Average	94.1%	94.1%	94.1%	13,916
10% Model	Real	93.1%	95.2%	94.1%	3,189
	Fake	96.1%	93.7%	94.9%	5,168
	Edited	94.3%	94.7%	94.5%	5,559
	Average	94.5%	94.5%	94.5%	13,916
15% Model	Real	93.8%	95.7%	94.7%	3,189
	Fake	96.3%	94.2%	95.2%	5,168
	Edited	94.6%	94.9%	94.8%	5,559
	Average	94.9%	94.9%	94.9%	13,916
20% Model ⭐	Real	94.2%	96.1%	95.1%	3,189
	Fake	96.8%	94.8%	95.8%	5,168
	Edited	95.1%	95.3%	95.2%	5,559
	Average	95.4%	95.4%	95.4%	13,916
25% Model	Real	93.5%	95.6%	94.5%	3,189
	Fake	96.2%	94.0%	95.1%	5,168
	Edited	94.5%	94.7%	94.6%	5,559
	Average	94.7%	94.7%	94.7%	13,916

Key Class-Specific Observations:

Consistent Performance Across Classes: All models maintain relatively balanced performance across the three classes, with no significant bias toward any particular class despite the class imbalance in the dataset.
Fake Detection Precision: Notably, the fake class consistently shows the highest precision across all models, indicating the model's strong ability to avoid false positives when identifying synthetically generated content.
Real Class Recognition: The real class exhibits the highest recall in all models, suggesting the model is especially effective at identifying authentic content.
20% Model Superiority: The 20% model achieves the best performance across all classes and metrics, confirming that this is the optimal data point for all forgery types.
Class Imbalance Handling: Despite the 1.75:1 class imbalance ratio in the dataset, all models maintain balanced performance across classes, demonstrating effective class-balanced training strategies.

🧪 Training Methodology

Experimental Setup

Dataset Preparation:

Total available samples: 139,256 (Real: 31,898, Fake: 51,688, Edited: 55,670)
Split ratio: 80% train, 10% validation, 10% test
Consistent preprocessing across all data percentage experiments
Stratified sampling to maintain class distribution

Training Configuration:

# Consistent hyperparameters across all experiments
BATCH_SIZE = 32
LEARNING_RATE = 1e-4
EPOCHS = 6  # Early convergence achieved
OPTIMIZER = "AdamW"
SCHEDULER = "CosineAnnealingLR"
WEIGHT_DECAY = 1e-4

Model Architecture:

ResNet50 backbone (pre-trained on ImageNet) for local feature extraction
Vision Transformer encoder (6 layers, 8 attention heads, 256 embedding dim) for global context integration
Complementary feature extraction: ResNet captures local textures and patterns, ViT models long-range dependencies
Input resolution: 224×224×3
Output classes: 3 (Real, Fake, Edited)

🔍 Key Research Insights

Data Efficiency Findings:

Optimal Data Point: 20% of available data (22,283 samples) achieves peak performance
Diminishing Returns: Performance plateaus and even decreases beyond 20% training data
Resource Optimization: 80% reduction in training data with superior performance
Generalization: Lower validation loss indicates better model generalization

Architectural Insights:

Complementary Feature Extraction: The combination of ResNet (local features) and ViT (global features) provides comprehensive image analysis
Feature Importance Distribution: As shown in the class feature importance visualization, the model learns to focus on different image regions for different forgery types
Feature Separability: The t-SNE visualization demonstrates excellent separation between class embeddings, indicating robust feature learning
Local-Global Synergy: Local patterns from ResNet combined with global context from ViT creates a more complete understanding of image manipulation cues

Implications for Practitioners:

Cost-Effective Training: Achieve state-of-the-art results with significantly less data
Faster Iteration: Reduced training time enables rapid experimentation
Resource Planning: Clear guidelines for dataset collection and annotation efforts
Transfer Learning: Framework applicable to other computer vision tasks

🛠️ Troubleshooting & FAQ

🚨 Common Issues & Solutions

Issue	Symptoms	Solution
🔴 Model Loading Error	`FileNotFoundError: forgery_detection_model.pt`	Ensure model file is in `Interface/` directory
🟡 CUDA Out of Memory	`RuntimeError: CUDA out of memory`	Reduce batch size or use CPU inference
🔵 Video Processing Fails	`cv2.error: Could not open video`	Check codec compatibility, convert to MP4
🟠 Web Interface Not Loading	`Connection refused on port 5000`	Check port availability, try different port

🔧 Performance Optimization

# Check GPU availability
python -c "import torch; print(torch.cuda.is_available())"

# Monitor GPU usage
nvidia-smi -l 1

# Test model loading
python -c "
from Interface.forgery_detector import ForgeryDetector
detector = ForgeryDetector('Interface/forgery_detection_model.pt')
print('Model loaded successfully')
"

🔒 Security & Privacy

🛡️ Security Features

Local Processing: All data processed locally, no external server communication
Automatic Cleanup: Uploaded files automatically deleted after processing
No Data Logging: Input content is not stored or logged
Secure File Handling: Input validation and sanitization
File Type Validation: Only allowed formats accepted
Size Limits: Prevents DoS attacks through large file uploads

🚀 Future Roadmap

🎯 Planned Features

Real-time webcam analysis for live video streams
Audio deepfake detection for voice synthesis
Mobile applications for iOS and Android
Explainable AI with attention visualization

🔬 Research Directions

Completed Research Contributions:

✅ Data Efficiency Analysis: Systematic study identifying optimal training data requirements (20% of available data)
✅ Performance Plateau Identification: Demonstrated diminishing returns beyond 20% training data
✅ Resource Optimization Framework: Established methodology for cost-effective model training

Future Directions:

Cross-dataset Generalization: Improving performance across different data sources
Few-shot Learning: Adapting to new manipulation techniques with minimal data
Temporal Consistency: Leveraging video temporal information for better detection
Multimodal Fusion: Combining visual, audio, and metadata for comprehensive analysis

📞 Support & Community

🌟 Contributing

We welcome contributions! Here's how you can help:

🐛 Report Bugs: Use GitHub issues with detailed reproduction steps
💡 Suggest Features: Share your ideas in GitHub discussions
🔧 Submit Code: Fork, develop, and submit pull requests
📖 Improve Docs: Help make documentation clearer
🧪 Test & Validate: Help test new features and edge cases

🙏 Acknowledgments

🎓 Research Foundation

Vision Transformer: Dosovitskiy, A., et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR 2021
ResNet Architecture: He, K., et al. "Deep Residual Learning for Image Recognition." CVPR 2016
Deepfake Detection: Li, Y., et al. "In Ictu Oculi: Exposing AI Generated Fake Face Videos by Detecting Eye Blinking." WIFS 2018

🛠️ Technology Stack

PyTorch Team: For the exceptional deep learning framework
Flask Community: For the lightweight and flexible web framework
OpenCV Contributors: For comprehensive computer vision tools

📊 Project Statistics

🔗 Quick Links

📖 Documentation • 🚀 Quick Start • 🎯 Demo • 🤝 Contribute

Last Updated: May 2025 | Version: 1.0.0 | Maintainer: Amin Shennan

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Interface		Interface
deployment		deployment
models		models
processed_data		processed_data
training_logs		training_logs
.gitattributes		.gitattributes
Data_Preparation.ipynb		Data_Preparation.ipynb
LICENSE.txt		LICENSE.txt
README.md		README.md
Traning RestNet + ViT.ipynb		Traning RestNet + ViT.ipynb
quick_start.py		quick_start.py
requirements.txt		requirements.txt
setup.py		setup.py

License

aminshennan/Forgery_Detection_with_ViT_and_RestNet

Folders and files

Latest commit

History

Repository files navigation