This project focuses on applying Natural Language Processing (NLP) techniques to classify news headlines as real (1) or fake (0).
Text processing and machine-learning skills are essential in Data Science, and this challenge provides a practical environment to apply them.
The objective is to build a high-performing classifier using the provided training dataset and use it to generate predictions for unseen testing data.
.
├── dataset/
│ ├── training_data.csv
│ └── testing_data.csv
├── main.ipynb # Individual notebook (full workflow & experiments)
├── NLP2.ipynb # Team notebook used for the presentation
├── predictions_best_model_fast.csv # Final predictions generated by best model
└── pressentation.pptx # Project presentation slides
Build and evaluate multiple NLP-based machine-learning models to classify news headlines as fake or real, and generate predictions for the test dataset, respecting the original file format.
The file training_data.csv was split into:
- Training set
- Validation set
(used to compare models and select the best one)
Applied standard NLP cleaning steps:
- Lowercasing
- Removing punctuation and symbols
- Tokenization
- Stopword removal (NLTK English stopwords)
- Lemmatization
Experiments were performed using TF-IDF Vectorizer:
The following machine-learning algorithms were evaluated:
- Logistic Regression (baseline and tuned)
- Random Forest (baseline and tuned)
- XGBoost (baseline and tuned)
Hyperparameter tuning was performed using Random Search to avoid long GridSearchCV runtimes.
| Model | Accuracy |
|---|---|
| LR_baseline | 0.9375 |
| LR_tuned | 0.9398 |
| RF_baseline | 0.9170 |
| RF_tuned | 0.8975 |
| XGB_baseline | 0.8949 |
| XGB_tuned | 0.8745 |
✅ Best Model Selected: LR_tuned (Tuned Logistic Regression)
[[3288 227]
[ 184 3132]]
0.9398
| Class | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| 0 | 0.95 | 0.94 | 0.94 | 3515 |
| 1 | 0.93 | 0.94 | 0.94 | 3316 |
Accuracy: 0.94
Macro avg – precision: 0.94, recall: 0.94, f1-score: 0.94
Weighted avg – precision: 0.94, recall: 0.94, f1-score: 0.94
Using the best model (LR_tuned), predictions were generated for the file testing_data.csv.
The file predictions_best_model_fast.csv replaces label 2 with:
- 0 → fake
- 1 → real
The output format strictly follows the original dataset structure (no added columns).
✔ main.ipynb — Individual full workflow
✔ NLP2.ipynb — Team notebook used for group presentation
✔ predictions_best_model_fast.csv — Final required predictions file
✔ pressentation.pptx — Slides summarizing approach & results
Although the current Logistic Regression model performs well, several enhancements could further improve the performance and robustness of the pipeline:
- Run full GridSearchCV for deeper hyperparameter exploration once computational time is less restricted.
- Expand Random Search parameter ranges for broader search coverage.
- Try alternative classifiers, such as:
- LightGBM
- CatBoost
- Linear SVM with better tuning
- Fine-tune TF-IDF parameters (e.g., n-gram ranges, min_df, max_df, max_features).
- Experiment with CountVectorizer + feature selection (chi-square, mutual information).
- Use word embeddings, such as:
- Word2Vec
- GloVe
- FastText
- Test transformer-based models, for example:
- BERT
- DistilBERT
- RoBERTa
These models often outperform classical ML pipelines on text classification tasks.
- Implement cross-validation to reduce variance in evaluation.
- Add model ensembling (stacking or averaging multiple models).
- Apply error analysis to understand misclassified headlines and improve preprocessing.
- Build a reproducible ML pipeline using tools like
PipelineandColumnTransformer.