Skip to content

Playing with the possibilities offered by Natural Language and its application in real contexts

Notifications You must be signed in to change notification settings

datosyciber/NatualLanguageProcessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

NLP Projects: Sentiment Analysis & Phishing Email Detection

This repository contains two practical Natural Language Processing (NLP) projects that demonstrate the full lifecycle of text-based Machine Learning systems: from raw text preprocessing to vectorisation, model training, evaluation, and prediction.

The first project focuses on Sentiment Analysis, classifying user opinions as positive or negative using classical NLP techniques. The second project applies the same concepts to a real cybersecurity use case: detecting phishing emails, distinguishing malicious messages from legitimate ones with high accuracy.

Both projects share a consistent structure and methodology:

  • Robust text cleaning and linguistic preprocessing (tokenisation, stopword removal, lemmatisation)

  • TF-IDF vectorisation to transform text into numerical features

  • Machine Learning models such as Naive Bayes and Support Vector Machines

  • Evaluation through precision, recall, F1-score, and confusion matrices

  • Prediction on new, unseen text

  • Feature interpretation to understand what drives the model's decisions

Together, these two projects illustrate how NLP techniques can be applied to both user sentiment understanding and cyber-threat detection, making this repository a strong demonstration of applied Data Science + Cybersecurity skills.

Sentiment Analysis with Classical NLP Techniques

This project implements a Sentiment Analysis system using classical Natural Language Processing (NLP) techniques and Machine Learning models. The objective is to classify sentences as positive or negative by applying a complete pipeline of text processing, vectorisation, and modelling.

1. Text Pre-processing

Before training any model, the text goes through several cleaning and normalisation phases to convert it into a format suitable for analysis.

1.1 Text Cleaning

This includes the removal of special characters, conversion to lowercase, and general text normalisation.

1.2 Tokenisation

Each sentence is divided into basic units of meaning called tokens (usually words).

1.3 Stopword Removal

Common words in the language (such as the, and, but, etc.) that do not contribute relevant information to the model are removed.

1.4 Lemmatisation

Lemmatisation helps to group different grammatical variants of the same word. Each word is reduced to its base or dictionary form, for example:

  • loved → love

  • watching → watch

1.5 Stemming

A process that trims words down to their approximate root (e.g. running → run). This is included as an alternative for comparing results.

2. Numerical Transformation of Text

Once preprocessed, the texts are converted into numerical vectors that the models can interpret.

2.1 Bag of Words (BoW)

Represents each document according to the frequency of occurrence of its words.

2.2 TF-IDF

Method used in this project. It transforms the text into a matrix where:

  • Common words have less weight.

  • Distinctive words have greater importance.

  • It is a very effective classic technique for text classification.

2.3 Embeddings

The following are mentioned as advanced alternatives: Word2Vec, GloVe, FastText, etc. They are being considered for future projects.

3. Model Training

Classic machine learning models are trained on the TF-IDF matrix. Models considered:

  • Multinomial Naive Bayes. A simple and very effective model for text. It was the final choice due to its simplicity for application in this project.

  • SVM (Support Vector Machines). It usually achieves high accuracy in text classification.

  • Logistic Regression. Robust, efficient, and widely used for traditional NLP.

  • Transformers (BERT, RoBERTa, etc.). These are mentioned as a future option for extending the project to modern methods.

4. Sentiment Prediction

The system allows the sentiment of new sentences to be predicted. After applying the same pre-processing and vectorisation pipeline, the model returns:

  • positive

  • negative

Example of use

model.predict([‘I really enjoyed this movie, it was fantastic!’])
# → “pos”

Phishing Email Detection with NLP and Machine Learning

This project extends the repository with a second Natural Language Processing application: automatic detection of phishing emails. The objective is to classify emails into two categories:

  • Phishing Email

  • Safe Email

To achieve this, a classical NLP pipeline is applied to clean and normalise email text, followed by vectorisation with TF-IDF and classification using machine learning models.

1. Text Pre-processing for Emails

Emails often contain noisy text (URLs, numbers, formatting artifacts). The following steps are applied:

1.1 Cleaning & Normalisation

This phase involves removal of special characters and punctuation, normalisation of unicode anomalies, conversion to lowercase and optional handling of URLs and numbers.

1.2 Tokenisation

The raw email text is split into tokens (words), enabling structured processing.

1.3 Stopword Removal

Common English words that do not contribute to classification are removed to emphasise meaningful terms.

1.4 Lemmatization

The text is normalized by reducing words to their base form:

  • verifying → verify

  • clicking → click

Lemmas help reduce vocabulary size and improve generalisation. But in this case, no significant changes were been found while performing this conversion, so finally it was discarded.

2. Vectorisation with TF-IDF

As in the sentiment analysis project, texts are transformed using TF-IDF, which:

  • Downweights extremely common words (e.g., "email", "today").

  • Highlights words indicative of phishing (e.g., "verify", "account", "click").

  • Produces a sparse matrix suitable for classical ML models.

This representation worked particularly well for distinguishing malicious from legitimate emails.

3. Model Training & Performance

A Linear Support Vector Machine (SVM) was used for classification due to its strong performance with high-dimensional text features. The model achieves Accuracy ≈ 97–98%, high precision and recall for both classes and clear separation of phishing indicators and safe email vocabulary

Feature interpretation was performed by analysing model coefficients, revealing highly influential words for each class (e.g., “click”, “http”, “remove” for phishing and “thanks”, “attached”, “university” for safe emails).

4. Predicting Whether an Email Is Phishing

After applying the same preprocessing and TF-IDF transformation, the trained model can classify new, unseen emails:

  • Phishing Email

  • Safe Email

This makes the system suitable for practical security applications.

About

Playing with the possibilities offered by Natural Language and its application in real contexts

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors