A comprehensive, machine learning-powered spam email detection system with advanced features, ensemble modeling, and explainable AI.
- Multi-Model Ensemble: Combines Bernoulli Naive Bayes, Logistic Regression, and SVM for robust predictions
- Advanced Text Processing: Intelligent preprocessing with URL/email extraction and spam-specific feature engineering
- Confidence Scoring: Provides prediction confidence and detailed explanations
- Real-time Analysis: Fast processing with comprehensive feature extraction
- Explainable AI: Human-readable explanations for each prediction
- Robust Error Handling: Comprehensive logging and graceful failure recovery
- Feature Engineering: 15+ spam-specific features including:
- Text statistics (word count, character analysis)
- URL and domain reputation checking
- Spam keyword detection (urgency, money, suspicious, promotional)
- HTML content analysis
- Pattern recognition (repeated characters, excessive punctuation)
- Performance Monitoring: Processing time tracking and model performance logging
- Modern UI: Beautiful, responsive Streamlit interface with detailed analytics
- Accuracy: ~96-97% on test data
- Precision: High precision to minimize false positives
- Recall: Optimized to catch most spam emails
- Processing Time: <1 second per email
- Python 3.8 or higher
- pip package manager
-
Clone the repository:
git clone https://github.com/Naeem1144/spam-email-detection-system cd spam-email-detection-system -
Install dependencies:
pip install -r requirements.txt
-
Download NLTK data (automatic on first run):
import nltk nltk.download('punkt') nltk.download('stopwords')
-
Ensure model file exists:
- The system expects
Bernoulli_model_for_email.pklin the project directory - This file should contain a trained scikit-learn Pipeline
- The system expects
Run the Streamlit web application:
streamlit run spam_detector.pyThe application will open in your browser with:
- Email Analysis: Paste email content for instant spam detection
- Feature Visualization: Real-time feature extraction and analysis
- Detailed Explanations: AI-powered explanations for each prediction
- Model Insights: Individual model predictions and confidence scores
from spam_detector import SpamDetectorEnsemble
# Initialize the detector
detector = SpamDetectorEnsemble()
# Analyze an email
email_text = "Your email content here..."
result = detector.predict(email_text)
# Access results
print(f"Is Spam: {result.is_spam}")
print(f"Confidence: {result.confidence:.2%}")
print(f"Spam Probability: {result.spam_probability:.2%}")
print("Explanations:")
for explanation in result.explanation:
print(f" - {explanation}")βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Input Email βββββΆβ Text Processor βββββΆβ Feature Vector β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Explanations ββββββ Ensemble Models ββββββ Features + β
βββββββββββββββββββ ββββββββββββββββββββ β Cleaned Text β
βββββββββββββββββββ
- Purpose: Advanced text preprocessing and feature extraction
- Key Methods:
extract_features(): Extracts 15+ spam-specific featuresclean_text(): Advanced text cleaning for ML models_check_suspicious_domains(): URL reputation analysis
- Purpose: Main prediction engine with ensemble modeling
- Key Methods:
predict(): Complete email analysis with explanationsload_models(): Model initialization and ensemble creation_generate_explanation(): Human-readable prediction explanations
- Purpose: Structured container for prediction results
- Attributes: confidence, spam_probability, features_used, explanations, etc.
The system can be configured by modifying class parameters:
# Custom model path
detector = SpamDetectorEnsemble(model_path="path/to/your/model.pkl")
# Custom spam keywords
processor = AdvancedTextProcessor()
processor.spam_keywords['custom'] = ['your', 'keywords', 'here']Logs are written to spam_detector.log and console. Adjust logging level:
import logging
logging.getLogger('spam_detector').setLevel(logging.DEBUG)Use the provided test cases:
# Test cases
test_spam = "URGENT! Win $1,000,000 NOW! Click here immediately!"
test_legitimate = "Hi John, let's schedule our meeting for tomorrow at 2 PM."
detector = SpamDetectorEnsemble()
result_spam = detector.predict(test_spam)
result_legit = detector.predict(test_legitimate)Verify feature extraction:
processor = AdvancedTextProcessor()
features = processor.extract_features(email_text)
print(f"Features extracted: {len(features)}")
print(f"Spam keywords found: {sum(features[k] for k in features if 'keywords' in k)}")- β Proper Documentation: Comprehensive docstrings and type hints
- β Error Handling: Robust exception handling and logging
- β Code Structure: Object-oriented design with clear separation of concerns
- β Input Validation: Comprehensive input sanitization and validation
- β Advanced Preprocessing: URL/email extraction, HTML handling
- β Spam-Specific Features: 15+ engineered features for better detection
- β Domain Reputation: Suspicious domain detection
- β Pattern Recognition: Advanced text pattern analysis
- β Ensemble Voting: Multiple models for robust predictions
- β Confidence Scoring: Probabilistic outputs with confidence measures
- β Explainable AI: Detailed explanations for each prediction
- β Performance Tracking: Processing time and model performance monitoring
- β Modern UI: Beautiful, responsive Streamlit interface
- β Real-time Features: Live feature extraction and visualization
- β Detailed Analytics: Comprehensive analysis dashboard
- β Performance Metrics: Processing time and accuracy information
- Local Processing: All analysis performed locally, no data sent to external services
- Input Sanitization: Comprehensive input validation and cleaning
- Error Handling: Graceful failure modes to prevent system compromise
- Logging: Audit trail for all predictions and system events
-
Model File Not Found:
FileNotFoundError: Bernoulli_model_for_email.pklSolution: Ensure the trained model file is in the project directory
-
NLTK Data Missing:
LookupError: Resource punkt not foundSolution: Run
nltk.download('punkt')andnltk.download('stopwords') -
Memory Issues:
- Reduce batch processing size
- Consider model quantization for large deployments
- CPU Usage: The system is optimized for single-core performance
- Memory Usage: ~100MB RAM for typical usage
- Disk Usage: ~10MB for models and dependencies
For issues, feature requests, or contributions:
- Check the troubleshooting section
- Review the logs in
spam_detector.log - Create an issue with detailed error information
This project is available under the MIT License. See LICENSE file for details.
Version: 2.0
Last Updated: 2024
Compatibility: Python 3.8+, All major operating systems