Two-Stage News Recommendation via Candidate Generation and Ranking Models

This project builds and evaluates an industry-style two-stage recommender:

Candidate Generation narrows down the full news catalogue to a high-recall candidate pool using content similarity and behavioural priors.
Ranking applies increasingly expressive models to rank candidates and optimize top-K recommendation quality. The system is designed to mirror real-world recommender pipelines, where recall quality in the first stage directly determines the ceiling for ranking performance.

*This is a course project for CSE 847 Machine Learning at Michigan State University.

Dataset and Preprocessing

MIND-small (Microsoft News Dataset) contains ~65k news articles, ~230k user impression logs, user click histories, timestamps, categories, sources, and article metadata. Both training and validation splits are used, with strict chronological handling to prevent data leakage.

News Metadata Preprocessing

HTML and noise removal (BeautifulSoup + regex)
Standardization of punctuation and whitespace
Title + abstract concatenation for robust content representation
Identification of very short articles to exclude low-quality candidates
Publisher/source reconstruction from URLs

User Behavior Preprocessing

Expansion of impression logs into (user, article, label) format
Construction of ordered user click histories
Computation of smoothed CTR priors (news, category, source) using Laplace smoothing

Stage 1: Candidate Generation (Recall-Oriented)

Candidate articles are scored using a weighted combination of:

TF–IDF content similarity (Unigrams + bigrams, and Cosine similarity over recent clicked articles)
Popularity priors (news-level CTR with fallbacks to category/source/global CTR)
User category preferences (temperature-scaled softmax over historical clicks)
Source familiarity bonus (previously clicked publishers) The final candidate score is a weighted sum of these components. Candidate pool size K is selected via Recall@K, with K = 500 chosen as the best trade-off between recall and computation.

Stage 2: Ranking Models (Precision-Oriented)

Each impression’s candidate pool is ranked using multiple models:

Baselines: Logistic Regression (TF–IDF), XGBoost (TF–IDF) Neural Rankers: MLP v1: DistilBERT embeddings only, MLP v2 (Best Model)

Specifications for MLP v2 (Best Model):

TF–IDF reduced via Truncated SVD
Concatenated with DistilBERT embeddings
Trained using hard negative sampling from the candidate pool This hybrid model captures both lexical precision and semantic relevance, leading to large gains in personalised ranking.

Key Results

All evaluation is performed at the impression level using Hit@K, MRR@K, nDCG@K. Both two-stage and one-stage (standard MIND setup) evaluations are reported for comparison.

Candidate generation achieves Recall@500 ≈ 0.52, setting a strong recall ceiling.

Traditional models perform reasonably in one-stage ranking but struggle in two-stage settings.

Hybrid MLP (TF–IDF + BERT) delivers ~10× improvement in Hit@5, and large gains in MRR@5 and nDCG@5.

Results highlight the importance of high-quality candidate recall, hybrid feature representations, and hard-negative training aligned with the evaluation distribution.

Notebook Structure

bert_embeddings.zip contains precomputed DistilBERT sentence embeddings for all training and validation news articles to be used for ranking models.

nid2idx_train.pkl, nid2idx_val.pkl have mappings from news_id to embedding indices for training and validation sets for fast lookup of article embeddings during candidate expansion and ranking without repeated indexing.

val_candidate_pools_k500.parquet stores candidate pools (K = 500) generated for validation impressions during the retrieval stage, so all ranking models are evaluated on the same fixed candidate set.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitattributes		.gitattributes
CSE847_FinalProject_MahnoorSheikh.ipynb		CSE847_FinalProject_MahnoorSheikh.ipynb
CSE847_FinalReport_MahnoorSheikh.pdf		CSE847_FinalReport_MahnoorSheikh.pdf
README.md		README.md
bert_embeddings.zip		bert_embeddings.zip
nid2idx_train.pkl		nid2idx_train.pkl
nid2idx_val.pkl		nid2idx_val.pkl
val_candidate_pools_k500.parquet		val_candidate_pools_k500.parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Two-Stage News Recommendation via Candidate Generation and Ranking Models

Table of Contents

Dataset and Preprocessing

Stage 1: Candidate Generation (Recall-Oriented)

Stage 2: Ranking Models (Precision-Oriented)

Key Results

Notebook Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

mahnoorsheikh16/Two-Stage-News-Recommendation-System

Folders and files

Latest commit

History

Repository files navigation

Two-Stage News Recommendation via Candidate Generation and Ranking Models

Table of Contents

Dataset and Preprocessing

Stage 1: Candidate Generation (Recall-Oriented)

Stage 2: Ranking Models (Precision-Oriented)

Key Results

Notebook Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages