🛡️ Advanced Spam Email Detection System

A comprehensive, machine learning-powered spam email detection system with advanced features, ensemble modeling, and explainable AI.

🚀 Features

Core Capabilities

Multi-Model Ensemble: Combines Bernoulli Naive Bayes, Logistic Regression, and SVM for robust predictions
Advanced Text Processing: Intelligent preprocessing with URL/email extraction and spam-specific feature engineering
Confidence Scoring: Provides prediction confidence and detailed explanations
Real-time Analysis: Fast processing with comprehensive feature extraction
Explainable AI: Human-readable explanations for each prediction

Technical Features

Robust Error Handling: Comprehensive logging and graceful failure recovery
Feature Engineering: 15+ spam-specific features including:
- Text statistics (word count, character analysis)
- URL and domain reputation checking
- Spam keyword detection (urgency, money, suspicious, promotional)
- HTML content analysis
- Pattern recognition (repeated characters, excessive punctuation)
Performance Monitoring: Processing time tracking and model performance logging
Modern UI: Beautiful, responsive Streamlit interface with detailed analytics

📊 Model Performance

Accuracy: ~96-97% on test data
Precision: High precision to minimize false positives
Recall: Optimized to catch most spam emails
Processing Time: <1 second per email

🛠️ Installation

Prerequisites

Python 3.8 or higher
pip package manager

Setup Instructions

Clone the repository:

git clone https://github.com/Naeem1144/spam-email-detection-system
cd spam-email-detection-system

Install dependencies:
```
pip install -r requirements.txt
```

Download NLTK data (automatic on first run):

import nltk
nltk.download('punkt')
nltk.download('stopwords')

Ensure model file exists:
- The system expects Bernoulli_model_for_email.pkl in the project directory
- This file should contain a trained scikit-learn Pipeline

🚀 Usage

Web Application (Recommended)

Run the Streamlit web application:

streamlit run spam_detector.py

The application will open in your browser with:

Email Analysis: Paste email content for instant spam detection
Feature Visualization: Real-time feature extraction and analysis
Detailed Explanations: AI-powered explanations for each prediction
Model Insights: Individual model predictions and confidence scores

Programmatic Usage

from spam_detector import SpamDetectorEnsemble

# Initialize the detector
detector = SpamDetectorEnsemble()

# Analyze an email
email_text = "Your email content here..."
result = detector.predict(email_text)

# Access results
print(f"Is Spam: {result.is_spam}")
print(f"Confidence: {result.confidence:.2%}")
print(f"Spam Probability: {result.spam_probability:.2%}")
print("Explanations:")
for explanation in result.explanation:
    print(f"  - {explanation}")

📈 System Architecture

Component Overview

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Input Email   │───▶│  Text Processor  │───▶│  Feature Vector │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                │
                                ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Explanations  │◀───│ Ensemble Models  │◀───│   Features +    │
└─────────────────┘    └──────────────────┘    │  Cleaned Text   │
                                                └─────────────────┘

Classes and Methods

`AdvancedTextProcessor`

Purpose: Advanced text preprocessing and feature extraction
Key Methods:
- extract_features(): Extracts 15+ spam-specific features
- clean_text(): Advanced text cleaning for ML models
- _check_suspicious_domains(): URL reputation analysis

`SpamDetectorEnsemble`

Purpose: Main prediction engine with ensemble modeling
Key Methods:
- predict(): Complete email analysis with explanations
- load_models(): Model initialization and ensemble creation
- _generate_explanation(): Human-readable prediction explanations

`PredictionResult` (DataClass)

Purpose: Structured container for prediction results
Attributes: confidence, spam_probability, features_used, explanations, etc.

🔧 Configuration

Model Configuration

The system can be configured by modifying class parameters:

# Custom model path
detector = SpamDetectorEnsemble(model_path="path/to/your/model.pkl")

# Custom spam keywords
processor = AdvancedTextProcessor()
processor.spam_keywords['custom'] = ['your', 'keywords', 'here']

Logging Configuration

Logs are written to spam_detector.log and console. Adjust logging level:

import logging
logging.getLogger('spam_detector').setLevel(logging.DEBUG)

🧪 Testing

Manual Testing

Use the provided test cases:

# Test cases
test_spam = "URGENT! Win $1,000,000 NOW! Click here immediately!"
test_legitimate = "Hi John, let's schedule our meeting for tomorrow at 2 PM."

detector = SpamDetectorEnsemble()
result_spam = detector.predict(test_spam)
result_legit = detector.predict(test_legitimate)

Feature Validation

Verify feature extraction:

processor = AdvancedTextProcessor()
features = processor.extract_features(email_text)
print(f"Features extracted: {len(features)}")
print(f"Spam keywords found: {sum(features[k] for k in features if 'keywords' in k)}")

📝 Improvements Over Previous Version

Code Quality

✅ Proper Documentation: Comprehensive docstrings and type hints
✅ Error Handling: Robust exception handling and logging
✅ Code Structure: Object-oriented design with clear separation of concerns
✅ Input Validation: Comprehensive input sanitization and validation

Feature Engineering

✅ Advanced Preprocessing: URL/email extraction, HTML handling
✅ Spam-Specific Features: 15+ engineered features for better detection
✅ Domain Reputation: Suspicious domain detection
✅ Pattern Recognition: Advanced text pattern analysis

Model Enhancements

✅ Ensemble Voting: Multiple models for robust predictions
✅ Confidence Scoring: Probabilistic outputs with confidence measures
✅ Explainable AI: Detailed explanations for each prediction
✅ Performance Tracking: Processing time and model performance monitoring

User Experience

✅ Modern UI: Beautiful, responsive Streamlit interface
✅ Real-time Features: Live feature extraction and visualization
✅ Detailed Analytics: Comprehensive analysis dashboard
✅ Performance Metrics: Processing time and accuracy information

🔒 Security Considerations

Local Processing: All analysis performed locally, no data sent to external services
Input Sanitization: Comprehensive input validation and cleaning
Error Handling: Graceful failure modes to prevent system compromise
Logging: Audit trail for all predictions and system events

🐛 Troubleshooting

Common Issues

Model File Not Found:
```
FileNotFoundError: Bernoulli_model_for_email.pkl
```
Solution: Ensure the trained model file is in the project directory
NLTK Data Missing:
```
LookupError: Resource punkt not found
```
Solution: Run nltk.download('punkt') and nltk.download('stopwords')
Memory Issues:
- Reduce batch processing size
- Consider model quantization for large deployments

Performance Optimization

CPU Usage: The system is optimized for single-core performance
Memory Usage: ~100MB RAM for typical usage
Disk Usage: ~10MB for models and dependencies

📞 Support

For issues, feature requests, or contributions:

Check the troubleshooting section
Review the logs in spam_detector.log
Create an issue with detailed error information

📄 License

This project is available under the MIT License. See LICENSE file for details.

Version: 2.0
Last Updated: 2024
Compatibility: Python 3.8+, All major operating systems

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Bernoulli_model_for_email.pkl		Bernoulli_model_for_email.pkl
Email_classification.ipynb		Email_classification.ipynb
IMPROVEMENTS_SUMMARY.md		IMPROVEMENTS_SUMMARY.md
README.md		README.md
USAGE_GUIDE.md		USAGE_GUIDE.md
requirements.txt		requirements.txt
spam_detector.py		spam_detector.py
test_spam_detector.py		test_spam_detector.py

Folders and files

Latest commit

History

Repository files navigation

🛡️ Advanced Spam Email Detection System

🚀 Features

Core Capabilities

Technical Features

📊 Model Performance

🛠️ Installation

Prerequisites

Setup Instructions

🚀 Usage

Web Application (Recommended)

Programmatic Usage

📈 System Architecture

Component Overview

Classes and Methods

AdvancedTextProcessor

SpamDetectorEnsemble

PredictionResult (DataClass)

🔧 Configuration

Model Configuration

Logging Configuration

🧪 Testing

Manual Testing

Feature Validation

📝 Improvements Over Previous Version

Code Quality

Feature Engineering

Model Enhancements

User Experience

🔒 Security Considerations

🐛 Troubleshooting

Common Issues

Performance Optimization

📞 Support

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`AdvancedTextProcessor`

`SpamDetectorEnsemble`

`PredictionResult` (DataClass)

Packages