Online News Popularity Prediction

Project Overview

This project explores and predicts the popularity of online news articles using machine learning techniques. By analyzing a dataset of articles from Mashable.com, we aim to identify key factors that influence article engagement and develop models to predict the number of shares an article will receive.

Dataset

The dataset contains metadata from 39,797 articles published on Mashable.com, with 61 attributes including:

Content features (word count, images, links)
Keyword metrics
Natural language processing metrics (sentiment, subjectivity)
Temporal features (day of week, weekend)
Social metrics (shares, author followers) - additional data we collected ourselves after recognizing that the original dataset focused too heavily on article structure while neglecting content factors

Methodology

Exploratory Data Analysis

Analyzed class distribution of article shares
Examined content, channel, and author-related features
Identified temporal patterns in publishing and engagement
Performed correlation analysis to find predictive features

Feature Engineering

Transformed skewed features to improve distribution
Applied standardization to numeric features
Created interaction features and polynomial features
Reduced dimensionality by removing highly correlated features

Modeling

Applied SMOTE to handle class imbalance
Developed and compared multiple classification models:
- Logistic Regression
- Random Forest
- Gradient Boosting
- XGBoost
- LightGBM
- Neural Networks
Created ensemble models through stacking
Performed hyperparameter tuning for optimal performance
Evaluated with accuracy, F1-score and confusion matrices

Key Findings

Weekends show higher sharing rates despite fewer articles being published
Certain content channels (Tech, Social Media) correlate with higher shares
Keyword metrics strongly influence article shareability
Best model achieved ~68% accuracy in predicting share categories
Important features include: keyword metrics, whether published on weekend, channel type, and sentiment measures

Models Evaluated

Model	Test Accuracy	F1 Score
Optimized XGBoost	67.84%	67.85%
Top 3 Stacking	67.51%	67.55%
LightGBM	67.39%	67.35%
Gradient Boosting	67.00%	67.01%
Random Forest	66.97%	66.99%
Logistic Regression	66.41%	66.34%
XGBoost	66.12%	66.11%
Neural Network	61.45%	61.46%

Challenges and Future Work

Limited predictive accuracy suggests external factors influence sharing behavior
Feature engineering and hyperparameter optimization provided modest gains
Future approaches could incorporate real-time social media data
Integrating audience demographics and browsing behavior could improve prediction accuracy

Team Members

Anqi Gu, Evelyn Zhou, Han Zhang, Hongyu Liao, Yiwei Li

Course

MSDS 422 Practical Machine Learning (March 2025)

Detailed Report

For a comprehensive analysis and in-depth discussion of this project, please refer to our full report.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Data		Data
Final notebook images		Final notebook images
Previous Notebooks		Previous Notebooks
422_final_report.docx		422_final_report.docx
MSDS 422 Group 1 Final.ipynb		MSDS 422 Group 1 Final.ipynb
MSDS 422 Group 1 Final.pdf		MSDS 422 Group 1 Final.pdf
MSDS-422 Group 1 Final.pptx		MSDS-422 Group 1 Final.pptx
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Online News Popularity Prediction

Project Overview

Dataset

Methodology

Exploratory Data Analysis

Feature Engineering

Modeling

Key Findings

Models Evaluated

Challenges and Future Work

Team Members

Course

Detailed Report

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

hongyu-liao/MSDS-422-Final-Project-Group-1

Folders and files

Latest commit

History

Repository files navigation

Online News Popularity Prediction

Project Overview

Dataset

Methodology

Exploratory Data Analysis

Feature Engineering

Modeling

Key Findings

Models Evaluated

Challenges and Future Work

Team Members

Course

Detailed Report

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages