Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 45 additions & 20 deletions Detection Models/Email Spam Detection/README.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,60 @@

# Email Spam Detection

This project aims to detect unwanted emails (spam) from a user's inbox. Email spam detection involves using various techniques to analyze the content, metadata, and patterns of emails to distinguish between spam and legitimate messages.
This project aims to detect unwanted emails (spam) from a user's inbox. It uses a combination of **Natural Language Processing (NLP)** techniques and machine learning algorithms to analyze the content, metadata, and patterns of emails, distinguishing between spam and legitimate messages.

## Goal

The goal of email spam detection is to identify and filter out unsolicited emails (spam) from a user's inbox, ensuring that only legitimate and important emails are delivered. This helps in reducing clutter, protecting users from potential phishing attacks, malware, and other malicious activities commonly associated with spam emails.
The goal is to identify and filter out unsolicited emails (spam) from a user's inbox, ensuring only legitimate and important emails are delivered. This reduces clutter and protects users from phishing attacks, malware, and other malicious activities.

## Methodology

Utilizing a combination of EDA techniques and machine learning algorithms, we have meticulously analyzed data to discern patterns and correlations associated with email . Key steps include data cleaning, feature engineering, and insightful visualization to extract meaningful insights.
## Data Preprocessing
Using a mix of **Exploratory Data Analysis (EDA)** and **NLP**, the data is analyzed to identify patterns and correlations. Key steps include:
- Data cleaning
- Text preprocessing through NLP
- Feature engineering and insightful visualizations

## Data Preprocessing

Data preprocessing steps include:
Steps involved:
1. Stop words removal
2. Lemmatization/Stemming
3. Vectorization using TF-IDF
2. Lemmatization/Stemming (NLP)
3. Vectorization using **TF-IDF** (NLP)
4. Tokenization for breaking down email content

## NLP Techniques Used

- **Tokenization**: Breaking down email texts into words
- **Stemming/Lemmatization**: Reducing words to their base form
- **TF-IDF**: Transforming text into numerical values based on importance
- **Bag of Words**: Converting text into features for model building

## Models Utilized

1. Logistic Regression
2. Random Forest Regressor
3. Multinomial Naive Bayes
1. **Logistic Regression**
2. **Random Forest Regressor**
3. **Multinomial Naive Bayes**

## Libraries Used

1. numpy: For efficient numerical operations
2. pandas: For data manipulation and analysis
3. seaborn: For visually appealing statistical graphics
4. matplotlib: For comprehensive data visualization
5. Sklearn: For implementing machine learning algorithms
1. **numpy**: Numerical operations
2. **pandas**: Data manipulation and analysis
3. **nltk**: NLP toolkit for text processing
4. **seaborn**: Statistical visualizations
5. **matplotlib**: Data visualization
6. **sklearn**: Machine learning algorithms

## Results
1. Logistic Regression : 94%
2. Random Forest Regressor : 89%
3. Multinomial Naive Bayes : 96%

1. **Logistic Regression**: 98% accuracy
2. **Multinomial Naive Bayes**: 97% accuracy

## Enhanced Features

Additional NLP-based features include:
- Word frequency analysis using **nltk** and **collections.Counter**
- Character, word, and sentence count analysis
- Heatmap and distribution plots for word frequency and relationships

## Conclusion
Through rigorous analysis and experimentation, it has been determined that Multinomial Naive Bayes model exhibit the highest predictive accuracy for Email Spam detection.

Through rigorous analysis and the use of advanced **NLP** techniques, the **Multinomial Naive Bayes** model yielded the highest predictive accuracy for email spam detection.
Loading
Loading