📖 Usage Guide: Advanced Spam Email Detection System

This guide provides step-by-step instructions for setting up and using the Advanced Spam Email Detection System.

🚀 Quick Start

1. Installation

# Clone the repository
git clone <your-repo-url>
cd spam-email-detector

# Install dependencies
pip install -r requirements.txt

# Verify installation
python -c "import streamlit, nltk, sklearn; print('✅ All dependencies installed')"

2. First Run

# Start the web application
streamlit run spam_detector.py

The application will automatically:

Download required NLTK data on first run
Load the trained model
Open your browser to the application interface

🎯 Using the Web Interface

Main Features

📧 Email Analysis Panel:
- Paste your email content in the text area
- Click "🔍 Analyze Email" to get instant results
- View confidence scores and detailed explanations
📊 Feature Visualization Panel:
- Real-time feature extraction as you type
- Text statistics (word count, character analysis)
- Spam indicator metrics
- URL and domain analysis
📋 Results Display:
- Clear SPAM/SAFE classification
- Confidence percentage
- Detailed explanation with specific indicators
- Individual model predictions (when available)

Example Usage

Testing with Spam Email:

Input: "URGENT! Win $1,000,000 NOW! Click http://scam.tk"
Expected: 🚨 SPAM DETECTED with high confidence

Testing with Legitimate Email:

Input: "Hi John, let's schedule our meeting for tomorrow."
Expected: ✅ LEGITIMATE EMAIL with high confidence

💻 Programmatic Usage

Basic Usage

from spam_detector import SpamDetectorEnsemble

# Initialize the detector
detector = SpamDetectorEnsemble()

# Analyze an email
email_text = """
Subject: Meeting Tomorrow

Hi Sarah,

Just confirming our meeting tomorrow at 2 PM in the conference room.
Let me know if you need to reschedule.

Best,
John
"""

result = detector.predict(email_text)

# Print results
print(f"Spam: {result.is_spam}")
print(f"Confidence: {result.confidence:.1%}")
print(f"Probability: {result.spam_probability:.1%}")
print("\nExplanations:")
for explanation in result.explanation:
    print(f"  - {explanation}")

Advanced Usage

from spam_detector import AdvancedTextProcessor, SpamDetectorEnsemble

# Custom text processor
processor = AdvancedTextProcessor()

# Add custom spam keywords
processor.spam_keywords['custom'] = ['phishing', 'malware', 'trojan']

# Extract features manually
features = processor.extract_features(email_text)
print(f"Word count: {features['word_count']}")
print(f"URLs found: {features['url_count']}")
print(f"Spam keywords: {features['urgency_keywords'] + features['money_keywords']}")

# Clean text for analysis
cleaned_text = processor.clean_text(email_text)
print(f"Cleaned text: {cleaned_text}")

# Initialize detector with custom model path
detector = SpamDetectorEnsemble(model_path="path/to/your/model.pkl")

Batch Processing

def analyze_emails_batch(email_list):
    """Analyze multiple emails efficiently"""
    detector = SpamDetectorEnsemble()
    results = []
    
    for i, email in enumerate(email_list):
        try:
            result = detector.predict(email)
            results.append({
                'index': i,
                'is_spam': result.is_spam,
                'confidence': result.confidence,
                'spam_probability': result.spam_probability,
                'processing_time': result.processing_time
            })
        except Exception as e:
            results.append({
                'index': i,
                'error': str(e),
                'is_spam': False,
                'confidence': 0.0
            })
    
    return results

# Example usage
emails = [
    "Hi John, let's meet tomorrow",
    "URGENT! Free money! Click now!",
    "Your order #12345 has shipped"
]

results = analyze_emails_batch(emails)
for result in results:
    print(f"Email {result['index']}: {'SPAM' if result['is_spam'] else 'SAFE'}")

🧪 Testing the System

1. Unit Tests

# Run all tests
python test_spam_detector.py

# Run with verbose output
python test_spam_detector.py -v

# Run manual tests with sample data
python test_spam_detector.py --manual

2. Interactive Testing

Use the provided test samples:

# Test with obvious spam
spam_text = """
URGENT! CONGRATULATIONS! You have won $1,000,000!!!
Click here immediately: http://phishing-site.tk/claim
Verify your account NOW or lose your prize forever!
FREE MONEY! Act now! Limited time offer!
Contact winner@scam.com for details
"""

# Test with legitimate email
legit_text = """
Subject: Quarterly Meeting Agenda

Dear Team,

Please find attached the agenda for our quarterly review meeting
scheduled for next Tuesday at 10 AM in Conference Room B.

Items to discuss:
1. Q3 performance review
2. Q4 planning
3. Budget allocation

Please review the documents beforehand and come prepared with
your department updates.

Best regards,
Sarah Johnson
Project Manager
"""

detector = SpamDetectorEnsemble()
spam_result = detector.predict(spam_text)
legit_result = detector.predict(legit_text)

print("Spam Analysis:")
print(f"  Result: {'SPAM' if spam_result.is_spam else 'SAFE'}")
print(f"  Confidence: {spam_result.confidence:.1%}")

print("\nLegitimate Email Analysis:")
print(f"  Result: {'SPAM' if legit_result.is_spam else 'SAFE'}")
print(f"  Confidence: {legit_result.confidence:.1%}")

🔧 Configuration

1. Model Configuration

# Use custom model path
detector = SpamDetectorEnsemble(model_path="/path/to/custom/model.pkl")

# The model should be a scikit-learn Pipeline with:
# - TfidfVectorizer as first step
# - Classifier (BernoulliNB, etc.) as second step

2. Logging Configuration

import logging

# Set logging level
logging.getLogger('spam_detector').setLevel(logging.DEBUG)

# Add custom handler
handler = logging.FileHandler('custom_spam_logs.log')
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logging.getLogger('spam_detector').addHandler(handler)

3. Feature Engineering

processor = AdvancedTextProcessor()

# Customize spam keywords
processor.spam_keywords.update({
    'phishing': ['verify', 'suspend', 'security', 'update', 'confirm'],
    'scam': ['inheritance', 'lottery', 'beneficiary', 'transfer'],
    'tech_support': ['virus', 'infected', 'support', 'fix', 'repair']
})

# Add suspicious domains
suspicious_domains = ['.tk', '.ml', '.ga', '.cf', '.click']
# This would require modifying the _check_suspicious_domains method

📊 Understanding Results

Confidence Levels

90-100%: Very high confidence in classification
80-89%: High confidence
60-79%: Moderate confidence
40-59%: Low confidence (review recommended)
Below 40%: Very low confidence (manual review needed)

Spam Probability Interpretation

80-100%: Almost certainly spam
60-79%: Likely spam
40-59%: Uncertain (could be either)
20-39%: Likely legitimate
0-19%: Almost certainly legitimate

Common Spam Indicators

The system looks for these patterns:

Urgency words: urgent, immediate, act now, expires
Money words: free, cash, prize, win, lottery
Suspicious phrases: click here, verify account, suspend
Excessive punctuation: Multiple exclamation marks
Suspicious URLs: Short domains, suspicious TLDs
HTML content: Rich formatting in emails
Character patterns: All caps, repeated characters

🚨 Troubleshooting

Common Issues

Model file not found:
```
FileNotFoundError: Bernoulli_model_for_email.pkl
```
Solution: Ensure the model file is in the same directory as the script

NLTK data missing:

LookupError: Resource punkt not found

Solution: The system will auto-download, or run manually:

import nltk
nltk.download('punkt')
nltk.download('stopwords')

Import errors:

ImportError: No module named 'streamlit'

Solution: Install dependencies:

pip install -r requirements.txt

Performance issues:
- For large batch processing, consider processing in chunks
- Monitor memory usage with very long emails
- Check logs for processing time warnings

Performance Tips

For better accuracy:
- Include email headers when available
- Provide complete email content
- Use consistent text encoding
For faster processing:
- Pre-clean very long emails
- Cache the detector instance for multiple predictions
- Use batch processing for multiple emails
For debugging:
- Enable debug logging
- Check feature extraction results
- Review individual model predictions

🔒 Security Considerations

All processing is done locally
No data is sent to external services
Input sanitization prevents injection attacks
Error handling prevents system compromise
Logging provides audit trail

📈 Performance Monitoring

Monitor these metrics:

Processing time: Should be <1 second per email
Memory usage: Monitor for memory leaks in batch processing
Accuracy: Compare predictions with known spam/legitimate emails
Error rates: Check logs for prediction failures

🆘 Getting Help

Check the troubleshooting section above
Review logs in spam_detector.log
Test with simple examples first
Verify all dependencies are installed
Check model file integrity

Remember: This system is a tool to assist in spam detection. Always use human judgment for critical decisions, especially in business environments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📖 Usage Guide: Advanced Spam Email Detection System

🚀 Quick Start

1. Installation

2. First Run

🎯 Using the Web Interface

Main Features

Example Usage

💻 Programmatic Usage

Basic Usage

Advanced Usage

Batch Processing

🧪 Testing the System

1. Unit Tests

2. Interactive Testing

🔧 Configuration

1. Model Configuration

2. Logging Configuration

3. Feature Engineering

📊 Understanding Results

Confidence Levels

Spam Probability Interpretation

Common Spam Indicators

🚨 Troubleshooting

Common Issues

Performance Tips

🔒 Security Considerations

📈 Performance Monitoring

🆘 Getting Help

FilesExpand file tree

USAGE_GUIDE.md

Latest commit

History

USAGE_GUIDE.md

File metadata and controls

📖 Usage Guide: Advanced Spam Email Detection System

🚀 Quick Start

1. Installation

2. First Run

🎯 Using the Web Interface

Main Features

Example Usage

💻 Programmatic Usage

Basic Usage

Advanced Usage

Batch Processing

🧪 Testing the System

1. Unit Tests

2. Interactive Testing

🔧 Configuration

1. Model Configuration

2. Logging Configuration

3. Feature Engineering

📊 Understanding Results

Confidence Levels

Spam Probability Interpretation

Common Spam Indicators

🚨 Troubleshooting

Common Issues

Performance Tips

🔒 Security Considerations

📈 Performance Monitoring

🆘 Getting Help