📰 Natural Language Processing Challenge — Fake vs. Real News Classification

📌 Introduction

This project focuses on applying Natural Language Processing (NLP) techniques to classify news headlines as real (1) or fake (0).
Text processing and machine-learning skills are essential in Data Science, and this challenge provides a practical environment to apply them.

The objective is to build a high-performing classifier using the provided training dataset and use it to generate predictions for unseen testing data.

📂 Project Structure

.
├── dataset/
│   ├── training_data.csv
│   └── testing_data.csv
├── main.ipynb                      # Individual notebook (full workflow & experiments)
├── NLP2.ipynb                      # Team notebook used for the presentation
├── predictions_best_model_fast.csv # Final predictions generated by best model
└── pressentation.pptx              # Project presentation slides

🎯 Project Objective

Build and evaluate multiple NLP-based machine-learning models to classify news headlines as fake or real, and generate predictions for the test dataset, respecting the original file format.

🛠️ Methodology

1. Data Splitting

The file training_data.csv was split into:

Training set
Validation set
(used to compare models and select the best one)

2. Text Preprocessing

Applied standard NLP cleaning steps:

Lowercasing
Removing punctuation and symbols
Tokenization
Stopword removal (NLTK English stopwords)
Lemmatization

3. Vectorization

Experiments were performed using TF-IDF Vectorizer:

4. Models Trained

The following machine-learning algorithms were evaluated:

Logistic Regression (baseline and tuned)
Random Forest (baseline and tuned)
XGBoost (baseline and tuned)

Hyperparameter tuning was performed using Random Search to avoid long GridSearchCV runtimes.

🧪 Model Performance

🔎 Validation Accuracy Comparison

Model	Accuracy
LR_baseline	0.9375
LR_tuned	0.9398
RF_baseline	0.9170
RF_tuned	0.8975
XGB_baseline	0.8949
XGB_tuned	0.8745

✅ Best Model Selected: LR_tuned (Tuned Logistic Regression)

🥇 Best Model — Detailed Evaluation

Confusion Matrix

[[3288 227]
[ 184 3132]]

Accuracy

0.9398

Classification Report

Class	Precision	Recall	F1-score	Support
0	0.95	0.94	0.94	3515
1	0.93	0.94	0.94	3316

Accuracy: 0.94

Macro avg – precision: 0.94, recall: 0.94, f1-score: 0.94
Weighted avg – precision: 0.94, recall: 0.94, f1-score: 0.94

📤 Test Predictions

Using the best model (LR_tuned), predictions were generated for the file testing_data.csv.

The file predictions_best_model_fast.csv replaces label 2 with:

0 → fake
1 → real

The output format strictly follows the original dataset structure (no added columns).

📦 Deliverables

✔ main.ipynb — Individual full workflow
✔ NLP2.ipynb — Team notebook used for group presentation
✔ predictions_best_model_fast.csv — Final required predictions file
✔ pressentation.pptx — Slides summarizing approach & results

🔧 Improvements & Next Steps

Although the current Logistic Regression model performs well, several enhancements could further improve the performance and robustness of the pipeline:

Model & Training Improvements

Run full GridSearchCV for deeper hyperparameter exploration once computational time is less restricted.
Expand Random Search parameter ranges for broader search coverage.
Try alternative classifiers, such as:
- LightGBM
- CatBoost
- Linear SVM with better tuning

Vectorization & NLP Enhancements

Fine-tune TF-IDF parameters (e.g., n-gram ranges, min_df, max_df, max_features).
Experiment with CountVectorizer + feature selection (chi-square, mutual information).
Use word embeddings, such as:
- Word2Vec
- GloVe
- FastText

Deep Learning Approaches

Test transformer-based models, for example:
- BERT
- DistilBERT
- RoBERTa
  These models often outperform classical ML pipelines on text classification tasks.

Pipeline & Evaluation Improvements

Implement cross-validation to reduce variance in evaluation.
Add model ensembling (stacking or averaging multiple models).
Apply error analysis to understand misclassified headlines and improve preprocessing.
Build a reproducible ML pipeline using tools like Pipeline and ColumnTransformer.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dataset		dataset
NLP2.ipynb		NLP2.ipynb
README.md		README.md
main.ipynb		main.ipynb
presentation.pptx		presentation.pptx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📰 Natural Language Processing Challenge — Fake vs. Real News Classification

📌 Introduction

📂 Project Structure

🎯 Project Objective

🛠️ Methodology

1. Data Splitting

2. Text Preprocessing

3. Vectorization

4. Models Trained

🧪 Model Performance

🔎 Validation Accuracy Comparison

🥇 Best Model — Detailed Evaluation

Confusion Matrix

Accuracy

Classification Report

📤 Test Predictions

📦 Deliverables

🔧 Improvements & Next Steps

Model & Training Improvements

Vectorization & NLP Enhancements

Deep Learning Approaches

Pipeline & Evaluation Improvements

About

Uh oh!

Releases

Packages

Languages

JaviGARES/NLP-Project

Folders and files

Latest commit

History

Repository files navigation

📰 Natural Language Processing Challenge — Fake vs. Real News Classification

📌 Introduction

📂 Project Structure

🎯 Project Objective

🛠️ Methodology

1. Data Splitting

2. Text Preprocessing

3. Vectorization

4. Models Trained

🧪 Model Performance

🔎 Validation Accuracy Comparison

🥇 Best Model — Detailed Evaluation

Confusion Matrix

Accuracy

Classification Report

📤 Test Predictions

📦 Deliverables

🔧 Improvements & Next Steps

Model & Training Improvements

Vectorization & NLP Enhancements

Deep Learning Approaches

Pipeline & Evaluation Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages