This repository contains two practical Natural Language Processing (NLP) projects that demonstrate the full lifecycle of text-based Machine Learning systems: from raw text preprocessing to vectorisation, model training, evaluation, and prediction.
The first project focuses on Sentiment Analysis, classifying user opinions as positive or negative using classical NLP techniques. The second project applies the same concepts to a real cybersecurity use case: detecting phishing emails, distinguishing malicious messages from legitimate ones with high accuracy.
Both projects share a consistent structure and methodology:
-
Robust text cleaning and linguistic preprocessing (tokenisation, stopword removal, lemmatisation)
-
TF-IDF vectorisation to transform text into numerical features
-
Machine Learning models such as Naive Bayes and Support Vector Machines
-
Evaluation through precision, recall, F1-score, and confusion matrices
-
Prediction on new, unseen text
-
Feature interpretation to understand what drives the model's decisions
Together, these two projects illustrate how NLP techniques can be applied to both user sentiment understanding and cyber-threat detection, making this repository a strong demonstration of applied Data Science + Cybersecurity skills.
This project implements a Sentiment Analysis system using classical Natural Language Processing (NLP) techniques and Machine Learning models. The objective is to classify sentences as positive or negative by applying a complete pipeline of text processing, vectorisation, and modelling.
Before training any model, the text goes through several cleaning and normalisation phases to convert it into a format suitable for analysis.
This includes the removal of special characters, conversion to lowercase, and general text normalisation.
Each sentence is divided into basic units of meaning called tokens (usually words).
Common words in the language (such as the, and, but, etc.) that do not contribute relevant information to the model are removed.
Lemmatisation helps to group different grammatical variants of the same word. Each word is reduced to its base or dictionary form, for example:
-
loved → love
-
watching → watch
A process that trims words down to their approximate root (e.g. running → run). This is included as an alternative for comparing results.
Once preprocessed, the texts are converted into numerical vectors that the models can interpret.
Represents each document according to the frequency of occurrence of its words.
Method used in this project. It transforms the text into a matrix where:
-
Common words have less weight.
-
Distinctive words have greater importance.
-
It is a very effective classic technique for text classification.
The following are mentioned as advanced alternatives: Word2Vec, GloVe, FastText, etc. They are being considered for future projects.
Classic machine learning models are trained on the TF-IDF matrix. Models considered:
-
Multinomial Naive Bayes. A simple and very effective model for text. It was the final choice due to its simplicity for application in this project.
-
SVM (Support Vector Machines). It usually achieves high accuracy in text classification.
-
Logistic Regression. Robust, efficient, and widely used for traditional NLP.
-
Transformers (BERT, RoBERTa, etc.). These are mentioned as a future option for extending the project to modern methods.
The system allows the sentiment of new sentences to be predicted. After applying the same pre-processing and vectorisation pipeline, the model returns:
-
positive
-
negative
Example of use
model.predict([‘I really enjoyed this movie, it was fantastic!’]) # → “pos”
This project extends the repository with a second Natural Language Processing application: automatic detection of phishing emails. The objective is to classify emails into two categories:
-
Phishing Email
-
Safe Email
To achieve this, a classical NLP pipeline is applied to clean and normalise email text, followed by vectorisation with TF-IDF and classification using machine learning models.
Emails often contain noisy text (URLs, numbers, formatting artifacts). The following steps are applied:
This phase involves removal of special characters and punctuation, normalisation of unicode anomalies, conversion to lowercase and optional handling of URLs and numbers.
The raw email text is split into tokens (words), enabling structured processing.
Common English words that do not contribute to classification are removed to emphasise meaningful terms.
The text is normalized by reducing words to their base form:
-
verifying → verify
-
clicking → click
Lemmas help reduce vocabulary size and improve generalisation. But in this case, no significant changes were been found while performing this conversion, so finally it was discarded.
As in the sentiment analysis project, texts are transformed using TF-IDF, which:
-
Downweights extremely common words (e.g., "email", "today").
-
Highlights words indicative of phishing (e.g., "verify", "account", "click").
-
Produces a sparse matrix suitable for classical ML models.
This representation worked particularly well for distinguishing malicious from legitimate emails.
A Linear Support Vector Machine (SVM) was used for classification due to its strong performance with high-dimensional text features. The model achieves Accuracy ≈ 97–98%, high precision and recall for both classes and clear separation of phishing indicators and safe email vocabulary
Feature interpretation was performed by analysing model coefficients, revealing highly influential words for each class (e.g., “click”, “http”, “remove” for phishing and “thanks”, “attached”, “university” for safe emails).
After applying the same preprocessing and TF-IDF transformation, the trained model can classify new, unseen emails:
-
Phishing Email
-
Safe Email
This makes the system suitable for practical security applications.