A state-of-the-art deep learning system for detecting forged, manipulated, and synthetic media content using a hybrid ResNet50 + Vision Transformer (ViT) architecture.
π Quick Start β’ π Documentation β’ π― Demo β’ π€ Contributing
This project implements a comprehensive forgery detection system capable of classifying media content into three distinct categories with high accuracy. A key contribution of this research is the data efficiency study that identified the optimal training data size, demonstrating that 20% of available data achieves peak performance (95.36% validation accuracy) while maintaining computational efficiency. Additionally, the system leverages a complementary feature extraction approach, combining ResNet50 for local pattern analysis with Vision Transformer for global context modeling.
| Category | Description | Examples |
|---|---|---|
| π’ Real | Authentic, unmodified content | Original photos, genuine videos |
| π΄ Fake | Synthetically generated content | StyleGAN, VQGAN, AI-generated images |
| π‘ Edited | Manipulated authentic content | Deepfakes, face swaps, Wav2Lip |
- Multi-class Classification: Distinguishes between real, fake, and edited content with 95.36% validation accuracy
- Complementary Feature Extraction: Combines ResNet50 (local features) with Vision Transformer (global context)
- Data Efficiency Optimization: Achieved optimal performance using only 20% of available training data
- Class-Specific Feature Analysis: Identifies and leverages different image regions for different forgery types
- Hybrid Architecture: 6-layer Vision Transformer (8 attention heads, 256 embedding dim) built on ResNet50 features
- Comprehensive Study: Systematic evaluation of training data requirements (5%, 10%, 15%, 20%, 25%)
- Video Support: Frame-by-frame analysis with temporal aggregation
- Web Interface: User-friendly web application with drag & drop functionality
- Real-time Processing: Optimized for both batch and real-time inference
- Resource Efficient: Reduced training time and computational requirements through optimal data utilization
Based on the verification report from our processed dataset:
- Total Samples: 139,256 images
- Training Set: 111,417 samples (80%)
- Validation Set: 13,916 samples (10%)
- Test Set: 13,923 samples (10%)
Exact Class Distribution:
| Split | Edited | Fake | Real | Total |
|---|---|---|---|---|
| Train | 44,551 (40.0%) | 41,350 (37.1%) | 25,516 (22.9%) | 111,417 |
| Validation | 5,559 (39.9%) | 5,168 (37.1%) | 3,189 (22.9%) | 13,916 |
| Test | 5,560 (39.9%) | 5,170 (37.1%) | 3,193 (22.9%) | 13,923 |
| Total | 55,670 (40.0%) | 51,688 (37.1%) | 31,898 (22.9%) | 139,256 |
Class Imbalance Ratio: 1.75 (Majority:Minority)
This system leverages a complementary dual-feature extraction strategy:
-
Local Feature Extraction (ResNet50): Captures fine-grained local patterns and textures that may indicate manipulation, including compression artifacts, noise inconsistencies, and edge anomalies at the pixel level.
-
Global Feature Integration (Vision Transformer): Analyzes relationships between distant image regions, capturing semantic inconsistencies and global context that may not be apparent locally. The self-attention mechanism effectively models long-range dependencies in the feature space.
This complementary approach allows the model to simultaneously reason about both local manipulation artifacts and global image coherence, resulting in more robust forgery detection.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FORGERY DETECTION PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β INPUT STAGE β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Image/Video β Preprocessing β Normalization β Resizing (224Γ224) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β FEATURE EXTRACTION β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ResNet50 Backbone β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β β β Conv1 ββ β Layer1 ββ β Layer2 ββ β Layer3 ββ β Layer4 β β β
β β β 64Γ112Β² β β 256Γ56Β² β β 512Γ28Β² β β1024Γ14Β² β β2048Γ7Β² β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β TRANSFORMER PROCESSING β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Patch Embedding β Positional Encoding β Multi-Head Attention β β
β β β β β β β
β β βΌ βΌ βΌ β β
β β [2048,7,7] β [256,49] β [256,49] β Self-Attention Layers (Γ6) β β
β β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β ββ Layer 1: Query, Key, Value Matrices β β
β β ββ Layer 2: Multi-Head Attention (8 heads) β β
β β ββ Layer 3: Feed-Forward Network (1024 dim) β β
β β ββ Layer 4: Residual Connections β β
β β ββ Layer 5: Layer Normalization β β
β β ββ Layer 6: Global Context Integration β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β CLASSIFICATION β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Global Pooling β Linear Layer β Softmax β [Real, Fake, Edited] β β
β β [256] β [256,3] β [3] β Confidence Scores β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β OUTPUT β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Prediction Class + Confidence Score + Per-Class Probabilities β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Specification | Details |
|---|---|---|
| ποΈ Backbone | ResNet50 | Pre-trained on ImageNet, extracts local textures and patterns |
| π Transformer | 6-layer encoder | 8 attention heads, 256 embedding dim, models global image context |
| π Integration | Complementary fusion | Local features (ResNet) + Global features (ViT) |
| π Input Size | 224Γ224Γ3 | RGB images, normalized to ImageNet stats |
| π― Output Classes | 3 categories | Real, Fake, Edited with confidence scores |
| β‘ Framework | PyTorch 1.12+ | TorchScript optimized for deployment |
| πΎ Model Size | 167MB | TorchScript compiled model |
| π Deployment | TorchScript | Cross-platform inference optimization |
Forgery_Detection_final/
βββ π Interface/ # π Web Application Layer
β βββ π server.py # Flask web server & API endpoints
β βββ π forgery_detector.py # Core detection logic & model interface
β βββ π index.html # Main web interface (drag & drop)
β βββ π about.html # Project documentation page
β βββ π static/ # CSS, JS, and static assets
β βββ π uploads/ # Temporary uploaded files
β βββ π results/ # Processing results & outputs
β βββ π€ forgery_detection_model.pt # Trained model (167MB TorchScript)
βββ π deployment/ # π Production Deployment
β βββ π§ inference.py # Standalone inference script
β βββ π³ Dockerfile # Container configuration
βββ π models/ # πΎ Model Artifacts & Visualizations
β βββ π checkpoints/ # Training checkpoints & weights
β βββ π predictions_0.0500__model.png # 5% model predictions
β βββ π predictions_0.1000__model.png # 10% model predictions
β βββ π predictions_0.1500__model.png # 15% model predictions
β βββ π predictions_0.2000__model.png # 20% model predictions (selected)
β βββ π predictions_0.2500__model.png # 25% model predictions
βββ π processed_data/ # π Processed Datasets (139,256 samples)
β βββ π train/ # Training data (111,417 samples)
β β βββ π real/ # Authentic images
β β βββ π fake/ # Synthetic images
β β βββ π edited/ # Manipulated images
β βββ π val/ # Validation data (13,916 samples)
β β βββ π real/ # Authentic images
β β βββ π fake/ # Synthetic images
β β βββ π edited/ # Manipulated images
β βββ π test/ # Test data (13,923 samples)
β β βββ π real/ # Authentic images
β β βββ π fake/ # Synthetic images
β β βββ π edited/ # Manipulated images
β βββ π train_metadata.csv # Training set metadata
β βββ π val_metadata.csv # Validation set metadata
β βββ π test_metadata.csv # Test set metadata
β βββ π verification_report.json # Dataset statistics & validation
βββ π training_logs/ # π Training Metrics & Visualizations
β βββ π tsne_visualization_0.2000.png # t-SNE feature visualization (20% model)
β βββ π confusion_matrix_0.2000.png # Confusion matrix (20% model)
β βββ π accuracy_vs_data_percent.png # Data efficiency comparison
β βββ π learning_curves_comparison.png # Training curves across all models
β βββ π f1_scores_comparison.png # F1-score comparison
β βββ π metrics_log_sample_0.2000.json # Actual training metrics (20% model)
β βββ π [other model metrics...] # Metrics for 5%, 10%, 15%, 25% models
βββ π Traning RestNet + ViT.ipynb # π§ Model training notebook
βββ π Data_Preparation.ipynb # π Data preprocessing notebook
βββ π requirements.txt # π¦ Python dependencies
βββ βοΈ setup.py # π οΈ Package installation script
βββ π quick_start.py # β‘ Quick setup and demo script
βββ π README.md # π This comprehensive guide
| Component | Minimum | Recommended | Notes |
|---|---|---|---|
| π Python | 3.8+ | 3.9+ | Required for PyTorch compatibility |
| πΎ RAM | 8GB | 16GB+ | For model loading and processing |
| πΏ Storage | 10GB | 20GB+ | Models, datasets, and dependencies |
| π₯οΈ GPU | Optional | CUDA 11.0+ | Significantly faster inference with GPU |
| π OS | Windows/Linux/macOS | Linux Ubuntu 20.04+ | Cross-platform support |
# 1οΈβ£ Clone the repository
git clone <repository-url>
cd Forgery_Detection_final
# 2οΈβ£ Create virtual environment (recommended)
python -m venv forgery_env
source forgery_env/bin/activate # Linux/macOS
# forgery_env\Scripts\activate # Windows
# 3οΈβ£ Install dependencies
pip install -r requirements.txt
# 4οΈβ£ Run quick start script
python quick_start.py --setup
# 5οΈβ£ Launch web interface
cd Interface
python server.py# Core ML and Deep Learning
torch>=1.12.0
torchvision>=0.13.0
torchaudio>=0.12.0
# Web Framework
flask>=2.0.0
flask-cors>=3.0.0
# Image and Video Processing
pillow>=8.0.0
opencv-python>=4.5.0
# Data Science and Utilities
numpy>=1.21.0
pandas>=1.3.0
matplotlib>=3.5.0
scikit-learn>=1.0.0
tqdm>=4.62.0
# Development and Jupyter
jupyter>=1.0.0
notebook>=6.4.0
# Optional: Production deployment
gunicorn>=20.1.01. Start the Web Server
cd Interface
python server.py
# Server will start at http://localhost:5000
# Access from any web browser2. Upload and Analyze
- Supported Formats: JPG, PNG, MP4, AVI, MOV
- File Size Limits: Images (10MB), Videos (100MB)
- Features: Drag & drop upload, real-time processing, downloadable results
Single Image Analysis
cd deployment
python inference.py forgery_detection_model.pt path/to/image.jpg
# Output:
# Prediction: fake
# Confidence: 87.3%
# Probabilities: Real=8.2%, Fake=87.3%, Edited=4.5%Video Analysis
cd Interface
python forgery_detector.py --model forgery_detection_model.pt \
--input video.mp4 \
--output annotated_video.mp4 \
--sample-rate 0.2from Interface.forgery_detector import ForgeryDetector
# Initialize detector
detector = ForgeryDetector("Interface/forgery_detection_model.pt")
# Analyze single image
result = detector.predict_image_file("path/to/image.jpg")
print(f"Prediction: {result['prediction']}")
print(f"Confidence: {result['confidence']:.2f}%")This study investigated the optimal amount of training data required for effective forgery detection. We trained models using different percentages of the available dataset to determine the best balance between data efficiency and performance.
| Data % | Training Samples | Best Validation Accuracy | Final Training Loss | Final Validation Loss | Epochs | Data Efficiency Score |
|---|---|---|---|---|---|---|
| 5% | 5,571 | 94.09% | 0.128 | 0.140 | 6 | 168.9 |
| 10% | 11,142 | 94.53% | 0.109 | 0.120 | 6 | 84.8 |
| 15% | 16,713 | 94.92% | 0.107 | 0.124 | 6 | 56.8 |
| 20% β | 22,283 | 95.36% | 0.106 | 0.112 | 6 | 42.8 |
| 25% | 27,854 | 94.71% | 0.104 | 0.121 | 6 | 34.0 |
Data Efficiency Score = (Validation Accuracy Γ 1000) / Training Samples
- Peak Performance: Achieved the highest validation accuracy (95.36%) among all tested models
- Optimal Data Efficiency: Best balance between performance and training data requirements
- Best Generalization: Lowest validation loss (0.112) indicating good generalization
- Diminishing Returns: 25% model showed decreased performance (-0.65%) despite 25% more training data
- Class-Balanced Performance: Best performance across all three classes (Real: 95.1% F1, Fake: 95.8% F1, Edited: 95.2% F1)
- Training Stability: Consistent convergence across 6 epochs without overfitting
t-SNE visualization of feature embeddings from the 20% model showing clear separation between real (blue), fake (orange), and edited (green) classes in the learned feature space.
Class-specific feature importance visualization for the 20% model, showing which image regions contribute most to classification decisions for real, fake, and edited images. Brighter areas (yellow/white) indicate regions with higher importance for class prediction. Note how the model focuses on different areas for different forgery types.
Performance curve showing validation accuracy across different training data percentages, highlighting the optimal point at 20%.
Training and validation curves for all models, demonstrating the superior convergence of the 20% model.
Confusion matrix for the selected 20% model showing classification performance across all three classes.
F1-score comparison across different data percentages for each class, confirming the superiority of the 20% model.
To provide deeper insight into the model performance, we analyzed how each data percentage model performed across the three classes:
| Model | Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|---|
| 5% Model | Real | 92.5% | 94.7% | 93.6% | 3,189 |
| Fake | 95.6% | 93.1% | 94.3% | 5,168 | |
| Edited | 94.2% | 94.5% | 94.3% | 5,559 | |
| Average | 94.1% | 94.1% | 94.1% | 13,916 | |
| 10% Model | Real | 93.1% | 95.2% | 94.1% | 3,189 |
| Fake | 96.1% | 93.7% | 94.9% | 5,168 | |
| Edited | 94.3% | 94.7% | 94.5% | 5,559 | |
| Average | 94.5% | 94.5% | 94.5% | 13,916 | |
| 15% Model | Real | 93.8% | 95.7% | 94.7% | 3,189 |
| Fake | 96.3% | 94.2% | 95.2% | 5,168 | |
| Edited | 94.6% | 94.9% | 94.8% | 5,559 | |
| Average | 94.9% | 94.9% | 94.9% | 13,916 | |
| 20% Model β | Real | 94.2% | 96.1% | 95.1% | 3,189 |
| Fake | 96.8% | 94.8% | 95.8% | 5,168 | |
| Edited | 95.1% | 95.3% | 95.2% | 5,559 | |
| Average | 95.4% | 95.4% | 95.4% | 13,916 | |
| 25% Model | Real | 93.5% | 95.6% | 94.5% | 3,189 |
| Fake | 96.2% | 94.0% | 95.1% | 5,168 | |
| Edited | 94.5% | 94.7% | 94.6% | 5,559 | |
| Average | 94.7% | 94.7% | 94.7% | 13,916 |
-
Consistent Performance Across Classes: All models maintain relatively balanced performance across the three classes, with no significant bias toward any particular class despite the class imbalance in the dataset.
-
Fake Detection Precision: Notably, the fake class consistently shows the highest precision across all models, indicating the model's strong ability to avoid false positives when identifying synthetically generated content.
-
Real Class Recognition: The real class exhibits the highest recall in all models, suggesting the model is especially effective at identifying authentic content.
-
20% Model Superiority: The 20% model achieves the best performance across all classes and metrics, confirming that this is the optimal data point for all forgery types.
-
Class Imbalance Handling: Despite the 1.75:1 class imbalance ratio in the dataset, all models maintain balanced performance across classes, demonstrating effective class-balanced training strategies.
Dataset Preparation:
- Total available samples: 139,256 (Real: 31,898, Fake: 51,688, Edited: 55,670)
- Split ratio: 80% train, 10% validation, 10% test
- Consistent preprocessing across all data percentage experiments
- Stratified sampling to maintain class distribution
Training Configuration:
# Consistent hyperparameters across all experiments
BATCH_SIZE = 32
LEARNING_RATE = 1e-4
EPOCHS = 6 # Early convergence achieved
OPTIMIZER = "AdamW"
SCHEDULER = "CosineAnnealingLR"
WEIGHT_DECAY = 1e-4Model Architecture:
- ResNet50 backbone (pre-trained on ImageNet) for local feature extraction
- Vision Transformer encoder (6 layers, 8 attention heads, 256 embedding dim) for global context integration
- Complementary feature extraction: ResNet captures local textures and patterns, ViT models long-range dependencies
- Input resolution: 224Γ224Γ3
- Output classes: 3 (Real, Fake, Edited)
Data Efficiency Findings:
- Optimal Data Point: 20% of available data (22,283 samples) achieves peak performance
- Diminishing Returns: Performance plateaus and even decreases beyond 20% training data
- Resource Optimization: 80% reduction in training data with superior performance
- Generalization: Lower validation loss indicates better model generalization
Architectural Insights:
- Complementary Feature Extraction: The combination of ResNet (local features) and ViT (global features) provides comprehensive image analysis
- Feature Importance Distribution: As shown in the class feature importance visualization, the model learns to focus on different image regions for different forgery types
- Feature Separability: The t-SNE visualization demonstrates excellent separation between class embeddings, indicating robust feature learning
- Local-Global Synergy: Local patterns from ResNet combined with global context from ViT creates a more complete understanding of image manipulation cues
Implications for Practitioners:
- Cost-Effective Training: Achieve state-of-the-art results with significantly less data
- Faster Iteration: Reduced training time enables rapid experimentation
- Resource Planning: Clear guidelines for dataset collection and annotation efforts
- Transfer Learning: Framework applicable to other computer vision tasks
| Issue | Symptoms | Solution |
|---|---|---|
| π΄ Model Loading Error | FileNotFoundError: forgery_detection_model.pt |
Ensure model file is in Interface/ directory |
| π‘ CUDA Out of Memory | RuntimeError: CUDA out of memory |
Reduce batch size or use CPU inference |
| π΅ Video Processing Fails | cv2.error: Could not open video |
Check codec compatibility, convert to MP4 |
| π Web Interface Not Loading | Connection refused on port 5000 |
Check port availability, try different port |
# Check GPU availability
python -c "import torch; print(torch.cuda.is_available())"
# Monitor GPU usage
nvidia-smi -l 1
# Test model loading
python -c "
from Interface.forgery_detector import ForgeryDetector
detector = ForgeryDetector('Interface/forgery_detection_model.pt')
print('Model loaded successfully')
"- Local Processing: All data processed locally, no external server communication
- Automatic Cleanup: Uploaded files automatically deleted after processing
- No Data Logging: Input content is not stored or logged
- Secure File Handling: Input validation and sanitization
- File Type Validation: Only allowed formats accepted
- Size Limits: Prevents DoS attacks through large file uploads
- Real-time webcam analysis for live video streams
- Audio deepfake detection for voice synthesis
- Mobile applications for iOS and Android
- Explainable AI with attention visualization
Completed Research Contributions:
- β Data Efficiency Analysis: Systematic study identifying optimal training data requirements (20% of available data)
- β Performance Plateau Identification: Demonstrated diminishing returns beyond 20% training data
- β Resource Optimization Framework: Established methodology for cost-effective model training
Future Directions:
- Cross-dataset Generalization: Improving performance across different data sources
- Few-shot Learning: Adapting to new manipulation techniques with minimal data
- Temporal Consistency: Leveraging video temporal information for better detection
- Multimodal Fusion: Combining visual, audio, and metadata for comprehensive analysis
We welcome contributions! Here's how you can help:
- π Report Bugs: Use GitHub issues with detailed reproduction steps
- π‘ Suggest Features: Share your ideas in GitHub discussions
- π§ Submit Code: Fork, develop, and submit pull requests
- π Improve Docs: Help make documentation clearer
- π§ͺ Test & Validate: Help test new features and edge cases
- Vision Transformer: Dosovitskiy, A., et al. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR 2021
- ResNet Architecture: He, K., et al. "Deep Residual Learning for Image Recognition." CVPR 2016
- Deepfake Detection: Li, Y., et al. "In Ictu Oculi: Exposing AI Generated Fake Face Videos by Detecting Eye Blinking." WIFS 2018
- PyTorch Team: For the exceptional deep learning framework
- Flask Community: For the lightweight and flexible web framework
- OpenCV Contributors: For comprehensive computer vision tools
π Project Statistics
π Quick Links
π Documentation β’ π Quick Start β’ π― Demo β’ π€ Contribute
Last Updated: May 2025 | Version: 1.0.0 | Maintainer: Amin Shennan





