Skip to content

Commit 83f6ec2

Browse files
authored
Merge pull request #1459 from Niraj1608/main
[Feature]: Enhance and merge email spam detection notebook with EDA and NLP improvements #1455
2 parents abc7cd1 + 65e7300 commit 83f6ec2

File tree

3 files changed

+103605
-20
lines changed

3 files changed

+103605
-20
lines changed
Lines changed: 45 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,60 @@
1-
21
# Email Spam Detection
32

4-
This project aims to detect unwanted emails (spam) from a user's inbox. Email spam detection involves using various techniques to analyze the content, metadata, and patterns of emails to distinguish between spam and legitimate messages.
3+
This project aims to detect unwanted emails (spam) from a user's inbox. It uses a combination of **Natural Language Processing (NLP)** techniques and machine learning algorithms to analyze the content, metadata, and patterns of emails, distinguishing between spam and legitimate messages.
4+
55
## Goal
66

7-
The goal of email spam detection is to identify and filter out unsolicited emails (spam) from a user's inbox, ensuring that only legitimate and important emails are delivered. This helps in reducing clutter, protecting users from potential phishing attacks, malware, and other malicious activities commonly associated with spam emails.
7+
The goal is to identify and filter out unsolicited emails (spam) from a user's inbox, ensuring only legitimate and important emails are delivered. This reduces clutter and protects users from phishing attacks, malware, and other malicious activities.
8+
89
## Methodology
910

10-
Utilizing a combination of EDA techniques and machine learning algorithms, we have meticulously analyzed data to discern patterns and correlations associated with email . Key steps include data cleaning, feature engineering, and insightful visualization to extract meaningful insights.
11-
## Data Preprocessing
11+
Using a mix of **Exploratory Data Analysis (EDA)** and **NLP**, the data is analyzed to identify patterns and correlations. Key steps include:
12+
- Data cleaning
13+
- Text preprocessing through NLP
14+
- Feature engineering and insightful visualizations
15+
16+
## Data Preprocessing
1217

13-
Data preprocessing steps include:
18+
Steps involved:
1419
1. Stop words removal
15-
2. Lemmatization/Stemming
16-
3. Vectorization using TF-IDF
20+
2. Lemmatization/Stemming (NLP)
21+
3. Vectorization using **TF-IDF** (NLP)
22+
4. Tokenization for breaking down email content
23+
24+
## NLP Techniques Used
25+
26+
- **Tokenization**: Breaking down email texts into words
27+
- **Stemming/Lemmatization**: Reducing words to their base form
28+
- **TF-IDF**: Transforming text into numerical values based on importance
29+
- **Bag of Words**: Converting text into features for model building
30+
1731
## Models Utilized
1832

19-
1. Logistic Regression
20-
2. Random Forest Regressor
21-
3. Multinomial Naive Bayes
33+
1. **Logistic Regression**
34+
2. **Random Forest Regressor**
35+
3. **Multinomial Naive Bayes**
36+
2237
## Libraries Used
2338

24-
1. numpy: For efficient numerical operations
25-
2. pandas: For data manipulation and analysis
26-
3. seaborn: For visually appealing statistical graphics
27-
4. matplotlib: For comprehensive data visualization
28-
5. Sklearn: For implementing machine learning algorithms
39+
1. **numpy**: Numerical operations
40+
2. **pandas**: Data manipulation and analysis
41+
3. **nltk**: NLP toolkit for text processing
42+
4. **seaborn**: Statistical visualizations
43+
5. **matplotlib**: Data visualization
44+
6. **sklearn**: Machine learning algorithms
45+
2946
## Results
30-
1. Logistic Regression : 94%
31-
2. Random Forest Regressor : 89%
32-
3. Multinomial Naive Bayes : 96%
47+
48+
1. **Logistic Regression**: 98% accuracy
49+
2. **Multinomial Naive Bayes**: 97% accuracy
50+
51+
## Enhanced Features
52+
53+
Additional NLP-based features include:
54+
- Word frequency analysis using **nltk** and **collections.Counter**
55+
- Character, word, and sentence count analysis
56+
- Heatmap and distribution plots for word frequency and relationships
3357

3458
## Conclusion
35-
Through rigorous analysis and experimentation, it has been determined that Multinomial Naive Bayes model exhibit the highest predictive accuracy for Email Spam detection.
59+
60+
Through rigorous analysis and the use of advanced **NLP** techniques, the **Multinomial Naive Bayes** model yielded the highest predictive accuracy for email spam detection.

0 commit comments

Comments
 (0)