Mini-project for SC1015 - Introduction to Data Science and Artificial Intelligence
There is an increasing number of scam victims. To combat this, we aim to use AI/ML to help humans flag out spam emails which may contain links to phishing scams, or ransom scams.
To train the model, we used a dataset from Kaggle containing unstructured email text data, each labeled scam or not scam.
To make sense of the unstructured text, we had explore methods to "clean" it in each case.
- Exploratory Data Analysis : NLP Pre-processing Techniques ("Tokenization", Lemmatization, Removing Stopwords)
- Training Machine Learning Models : NLP Techniques and Vectorization
We also counted the number of words that appeared in each category -- Spam and Not Spam.
We found that there were anomalous amounts of certain words so we added them to our stoplist and refined the "cleaned" data.
In the clean data, we compared the most frequently appeared words in each category.
We found that [“com”, “please”, and integers] commonly appeared in the both categories, and hypothesized that they don’t tell us much about the type of an email.
Using a correlation matrix, we found our hypothesis to be true. The appearance of the words had low correlation with either category.
We found 2 groups of words that are uniquely found in the frequency list of Spam. We highlight them in the notebook as they relate to each other to form meanings.
Our problem is a classification problem. We explored 4 other classification models outside the course and compared their accuracies :
- Logistic Regression - 98.3%
- Naive Bayes Classification - 91.1%
- Support Vector Machine - 99.2%
- Random Forest Classification - 98.2%
Support Vector Machine is our model of choice and it provides remarkable accuracy in detecting spam, which effectively solves our problem.
The obseravtions made from our Exploratory Data Analysis show that certain keywords appear more in spam emails. On top of the already effective model, we would advice users to be aware of such keywords, even if emails are not flagged to be spam. This can cover the remaining 0.8%.
Contributions:
Gaoyuan: Part 1, Part 2, Part 3.1
Ping Wee: Part 3.2 - 3.4, Part 4