Skip to content

JaviGARES/NLP-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📰 Natural Language Processing Challenge — Fake vs. Real News Classification

📌 Introduction

This project focuses on applying Natural Language Processing (NLP) techniques to classify news headlines as real (1) or fake (0).
Text processing and machine-learning skills are essential in Data Science, and this challenge provides a practical environment to apply them.

The objective is to build a high-performing classifier using the provided training dataset and use it to generate predictions for unseen testing data.


📂 Project Structure

.
├── dataset/
│   ├── training_data.csv
│   └── testing_data.csv
├── main.ipynb                      # Individual notebook (full workflow & experiments)
├── NLP2.ipynb                      # Team notebook used for the presentation
├── predictions_best_model_fast.csv # Final predictions generated by best model
└── pressentation.pptx              # Project presentation slides

🎯 Project Objective

Build and evaluate multiple NLP-based machine-learning models to classify news headlines as fake or real, and generate predictions for the test dataset, respecting the original file format.


🛠️ Methodology

1. Data Splitting

The file training_data.csv was split into:

  • Training set
  • Validation set
    (used to compare models and select the best one)

2. Text Preprocessing

Applied standard NLP cleaning steps:

  • Lowercasing
  • Removing punctuation and symbols
  • Tokenization
  • Stopword removal (NLTK English stopwords)
  • Lemmatization

3. Vectorization

Experiments were performed using TF-IDF Vectorizer:


4. Models Trained

The following machine-learning algorithms were evaluated:

  • Logistic Regression (baseline and tuned)
  • Random Forest (baseline and tuned)
  • XGBoost (baseline and tuned)

Hyperparameter tuning was performed using Random Search to avoid long GridSearchCV runtimes.


🧪 Model Performance

🔎 Validation Accuracy Comparison

Model Accuracy
LR_baseline 0.9375
LR_tuned 0.9398
RF_baseline 0.9170
RF_tuned 0.8975
XGB_baseline 0.8949
XGB_tuned 0.8745

Best Model Selected: LR_tuned (Tuned Logistic Regression)


🥇 Best Model — Detailed Evaluation

Confusion Matrix

[[3288 227]
[ 184 3132]]

Accuracy

0.9398

Classification Report

Class Precision Recall F1-score Support
0 0.95 0.94 0.94 3515
1 0.93 0.94 0.94 3316

Accuracy: 0.94

Macro avg – precision: 0.94, recall: 0.94, f1-score: 0.94
Weighted avg – precision: 0.94, recall: 0.94, f1-score: 0.94


📤 Test Predictions

Using the best model (LR_tuned), predictions were generated for the file testing_data.csv.

The file predictions_best_model_fast.csv replaces label 2 with:

  • 0 → fake
  • 1 → real

The output format strictly follows the original dataset structure (no added columns).


📦 Deliverables

main.ipynb — Individual full workflow
NLP2.ipynb — Team notebook used for group presentation
predictions_best_model_fast.csv — Final required predictions file
pressentation.pptx — Slides summarizing approach & results



🔧 Improvements & Next Steps

Although the current Logistic Regression model performs well, several enhancements could further improve the performance and robustness of the pipeline:

Model & Training Improvements

  • Run full GridSearchCV for deeper hyperparameter exploration once computational time is less restricted.
  • Expand Random Search parameter ranges for broader search coverage.
  • Try alternative classifiers, such as:
    • LightGBM
    • CatBoost
    • Linear SVM with better tuning

Vectorization & NLP Enhancements

  • Fine-tune TF-IDF parameters (e.g., n-gram ranges, min_df, max_df, max_features).
  • Experiment with CountVectorizer + feature selection (chi-square, mutual information).
  • Use word embeddings, such as:
    • Word2Vec
    • GloVe
    • FastText

Deep Learning Approaches

  • Test transformer-based models, for example:
    • BERT
    • DistilBERT
    • RoBERTa
      These models often outperform classical ML pipelines on text classification tasks.

Pipeline & Evaluation Improvements

  • Implement cross-validation to reduce variance in evaluation.
  • Add model ensembling (stacking or averaging multiple models).
  • Apply error analysis to understand misclassified headlines and improve preprocessing.
  • Build a reproducible ML pipeline using tools like Pipeline and ColumnTransformer.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published