Skip to content

Naeem1144/spam-email-detection-system

Repository files navigation

πŸ›‘οΈ Advanced Spam Email Detection System

A comprehensive, machine learning-powered spam email detection system with advanced features, ensemble modeling, and explainable AI.

πŸš€ Features

Core Capabilities

  • Multi-Model Ensemble: Combines Bernoulli Naive Bayes, Logistic Regression, and SVM for robust predictions
  • Advanced Text Processing: Intelligent preprocessing with URL/email extraction and spam-specific feature engineering
  • Confidence Scoring: Provides prediction confidence and detailed explanations
  • Real-time Analysis: Fast processing with comprehensive feature extraction
  • Explainable AI: Human-readable explanations for each prediction

Technical Features

  • Robust Error Handling: Comprehensive logging and graceful failure recovery
  • Feature Engineering: 15+ spam-specific features including:
    • Text statistics (word count, character analysis)
    • URL and domain reputation checking
    • Spam keyword detection (urgency, money, suspicious, promotional)
    • HTML content analysis
    • Pattern recognition (repeated characters, excessive punctuation)
  • Performance Monitoring: Processing time tracking and model performance logging
  • Modern UI: Beautiful, responsive Streamlit interface with detailed analytics

πŸ“Š Model Performance

  • Accuracy: ~96-97% on test data
  • Precision: High precision to minimize false positives
  • Recall: Optimized to catch most spam emails
  • Processing Time: <1 second per email

πŸ› οΈ Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Setup Instructions

  1. Clone the repository:

    git clone https://github.com/Naeem1144/spam-email-detection-system
    cd spam-email-detection-system
  2. Install dependencies:

    pip install -r requirements.txt
  3. Download NLTK data (automatic on first run):

    import nltk
    nltk.download('punkt')
    nltk.download('stopwords')
  4. Ensure model file exists:

    • The system expects Bernoulli_model_for_email.pkl in the project directory
    • This file should contain a trained scikit-learn Pipeline

πŸš€ Usage

Web Application (Recommended)

Run the Streamlit web application:

streamlit run spam_detector.py

The application will open in your browser with:

  • Email Analysis: Paste email content for instant spam detection
  • Feature Visualization: Real-time feature extraction and analysis
  • Detailed Explanations: AI-powered explanations for each prediction
  • Model Insights: Individual model predictions and confidence scores

Programmatic Usage

from spam_detector import SpamDetectorEnsemble

# Initialize the detector
detector = SpamDetectorEnsemble()

# Analyze an email
email_text = "Your email content here..."
result = detector.predict(email_text)

# Access results
print(f"Is Spam: {result.is_spam}")
print(f"Confidence: {result.confidence:.2%}")
print(f"Spam Probability: {result.spam_probability:.2%}")
print("Explanations:")
for explanation in result.explanation:
    print(f"  - {explanation}")

πŸ“ˆ System Architecture

Component Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Input Email   │───▢│  Text Processor  │───▢│  Feature Vector β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Explanations  │◀───│ Ensemble Models  │◀───│   Features +    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  Cleaned Text   β”‚
                                                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Classes and Methods

AdvancedTextProcessor

  • Purpose: Advanced text preprocessing and feature extraction
  • Key Methods:
    • extract_features(): Extracts 15+ spam-specific features
    • clean_text(): Advanced text cleaning for ML models
    • _check_suspicious_domains(): URL reputation analysis

SpamDetectorEnsemble

  • Purpose: Main prediction engine with ensemble modeling
  • Key Methods:
    • predict(): Complete email analysis with explanations
    • load_models(): Model initialization and ensemble creation
    • _generate_explanation(): Human-readable prediction explanations

PredictionResult (DataClass)

  • Purpose: Structured container for prediction results
  • Attributes: confidence, spam_probability, features_used, explanations, etc.

πŸ”§ Configuration

Model Configuration

The system can be configured by modifying class parameters:

# Custom model path
detector = SpamDetectorEnsemble(model_path="path/to/your/model.pkl")

# Custom spam keywords
processor = AdvancedTextProcessor()
processor.spam_keywords['custom'] = ['your', 'keywords', 'here']

Logging Configuration

Logs are written to spam_detector.log and console. Adjust logging level:

import logging
logging.getLogger('spam_detector').setLevel(logging.DEBUG)

πŸ§ͺ Testing

Manual Testing

Use the provided test cases:

# Test cases
test_spam = "URGENT! Win $1,000,000 NOW! Click here immediately!"
test_legitimate = "Hi John, let's schedule our meeting for tomorrow at 2 PM."

detector = SpamDetectorEnsemble()
result_spam = detector.predict(test_spam)
result_legit = detector.predict(test_legitimate)

Feature Validation

Verify feature extraction:

processor = AdvancedTextProcessor()
features = processor.extract_features(email_text)
print(f"Features extracted: {len(features)}")
print(f"Spam keywords found: {sum(features[k] for k in features if 'keywords' in k)}")

πŸ“ Improvements Over Previous Version

Code Quality

  • βœ… Proper Documentation: Comprehensive docstrings and type hints
  • βœ… Error Handling: Robust exception handling and logging
  • βœ… Code Structure: Object-oriented design with clear separation of concerns
  • βœ… Input Validation: Comprehensive input sanitization and validation

Feature Engineering

  • βœ… Advanced Preprocessing: URL/email extraction, HTML handling
  • βœ… Spam-Specific Features: 15+ engineered features for better detection
  • βœ… Domain Reputation: Suspicious domain detection
  • βœ… Pattern Recognition: Advanced text pattern analysis

Model Enhancements

  • βœ… Ensemble Voting: Multiple models for robust predictions
  • βœ… Confidence Scoring: Probabilistic outputs with confidence measures
  • βœ… Explainable AI: Detailed explanations for each prediction
  • βœ… Performance Tracking: Processing time and model performance monitoring

User Experience

  • βœ… Modern UI: Beautiful, responsive Streamlit interface
  • βœ… Real-time Features: Live feature extraction and visualization
  • βœ… Detailed Analytics: Comprehensive analysis dashboard
  • βœ… Performance Metrics: Processing time and accuracy information

πŸ”’ Security Considerations

  • Local Processing: All analysis performed locally, no data sent to external services
  • Input Sanitization: Comprehensive input validation and cleaning
  • Error Handling: Graceful failure modes to prevent system compromise
  • Logging: Audit trail for all predictions and system events

πŸ› Troubleshooting

Common Issues

  1. Model File Not Found:

    FileNotFoundError: Bernoulli_model_for_email.pkl
    

    Solution: Ensure the trained model file is in the project directory

  2. NLTK Data Missing:

    LookupError: Resource punkt not found
    

    Solution: Run nltk.download('punkt') and nltk.download('stopwords')

  3. Memory Issues:

    • Reduce batch processing size
    • Consider model quantization for large deployments

Performance Optimization

  • CPU Usage: The system is optimized for single-core performance
  • Memory Usage: ~100MB RAM for typical usage
  • Disk Usage: ~10MB for models and dependencies

πŸ“ž Support

For issues, feature requests, or contributions:

  1. Check the troubleshooting section
  2. Review the logs in spam_detector.log
  3. Create an issue with detailed error information

πŸ“„ License

This project is available under the MIT License. See LICENSE file for details.


Version: 2.0
Last Updated: 2024
Compatibility: Python 3.8+, All major operating systems

About

This Machine Learning Model can classify Phishing and Safe emails

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors