email-classifier

Mini-project for SC1015 - Introduction to Data Science and Artificial Intelligence

Our motivation

There is an increasing number of scam victims. To combat this, we aim to use AI/ML to help humans flag out spam emails which may contain links to phishing scams, or ransom scams.

Preparing the Data

To train the model, we used a dataset from Kaggle containing unstructured email text data, each labeled scam or not scam.
To make sense of the unstructured text, we had explore methods to "clean" it in each case.

Exploratory Data Analysis : NLP Pre-processing Techniques ("Tokenization", Lemmatization, Removing Stopwords)
Training Machine Learning Models : NLP Techniques and Vectorization

These were done writing our own "Tokenization" function that included lowercasing and removing punctuations ,and using the NLTK library.

We also counted the number of words that appeared in each category -- Spam and Not Spam.
We found that there were anomalous amounts of certain words so we added them to our stoplist and refined the "cleaned" data.

Exploratory Data Analysis

In the clean data, we compared the most frequently appeared words in each category.

Similarities

We found that [“com”, “please”, and integers] commonly appeared in the both categories, and hypothesized that they don’t tell us much about the type of an email.
Using a correlation matrix, we found our hypothesis to be true. The appearance of the words had low correlation with either category.

Differences

We found 2 groups of words that are uniquely found in the frequency list of Spam. We highlight them in the notebook as they relate to each other to form meanings.

Using Machine Learning (ML) to solve our problem:

Our problem is a classification problem. We explored 4 other classification models outside the course and compared their accuracies :

Logistic Regression - 98.3%
Naive Bayes Classification - 91.1%
Support Vector Machine - 99.2%
Random Forest Classification - 98.2%

Comparing the models, we provide explanation in our video presentation as to why SVM was the most accurate model.

Closing Points

Did we solve our problem?

Support Vector Machine is our model of choice and it provides remarkable accuracy in detecting spam, which effectively solves our problem.

Going forward...

The obseravtions made from our Exploratory Data Analysis show that certain keywords appear more in spam emails. On top of the already effective model, we would advice users to be aware of such keywords, even if emails are not flagged to be spam. This can cover the remaining 0.8%.

Contributions:
Gaoyuan: Part 1, Part 2, Part 3.1
Ping Wee: Part 3.2 - 3.4, Part 4

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Emails.csv		Emails.csv
README.md		README.md
SC1015 Project.ipynb		SC1015 Project.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

email-classifier

Our motivation

Preparing the Data

Exploratory Data Analysis

Similarities

Differences

Using Machine Learning (ML) to solve our problem:

Closing Points

Did we solve our problem?

Going forward...

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

lpwee/email-classifier

Folders and files

Latest commit

History

Repository files navigation

email-classifier

Our motivation

Preparing the Data

Exploratory Data Analysis

Similarities

Differences

Using Machine Learning (ML) to solve our problem:

Closing Points

Did we solve our problem?

Going forward...

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages