🔍 Search Engine with Learning-to-Rank (Production-Grade)

A production-style product search engine implementing a two-stage ranking architecture used in real-world e-commerce systems.

Built end-to-end: indexing → retrieval → feature engineering → learning-to-rank → evaluation → API serving, with measurable ranking improvements and real performance constraints.

🚀 What This System Does

Two-stage ranking pipeline:

Candidate Retrieval (BM25) Fast lexical search to recall top-K relevant products.
Learning-to-Rank Reranking (XGBoost) ML model reranks candidates using engineered relevance, popularity, and quality signals.

📈 Result:

BM25 NDCG@10 → 0.614
LTR NDCG@10 → 0.649
+5.7% improvement

🧠 Why This Project Is Strong

✅ Built a Custom Inverted Index

Enables fast document lookup and scalable preprocessing
Supports efficient query generation, labeling, and feature extraction
Demonstrates real information retrieval fundamentals, not just library usage

✅ True Industry Two-Stage Architecture

BM25 (fast recall, low latency)
        ↓
LTR (precision reranking)

Separates latency-critical and quality-critical logic
Matches architectures used at Amazon / Flipkart scale
Either stage can be upgraded independently

✅ Meaningful Feature Engineering (7 features)

Lexical relevance: bm25_score, token_overlap
Popularity & quality: log_review_count, avg_rating
Normalization & constraints: query_len, title_len, price

Fast, interpretable, production-safe — no black-box embeddings.

✅ Measurable Offline Evaluation

Implemented NDCG@k from scratch
Compared baseline vs reranker
Ranking gains are quantified, not hand-waved

✅ Production Mindset

Modular pipeline (restartable at any stage)
Centralized configuration
Structured logging with rotation
FastAPI inference service with health checks
Unit tests for core components

🏗️ High-Level Architecture

Raw JSONL (Amazon Electronics)
        ↓
Preprocessing & Indexing
        ↓
BM25 Retrieval (Top-100)
        ↓
Feature Extraction
        ↓
XGBoost LTR Reranking (Top-10)
        ↓
FastAPI Search API

⚙️ Tech Stack

Retrieval: BM25 (rank-bm25)
Ranking: XGBoost (rank:ndcg)
Backend: FastAPI, Uvicorn
Data: Pandas, NumPy
Evaluation: Custom NDCG@k
Infra mindset: Logging, config, tests

📂 Project Structure (Simplified)

src/
├── preprocessing/   # Data cleaning, indexing, labeling
├── retrieval/       # BM25 index & search
├── features/        # Feature engineering
├── ltr/             # Model training & inference
├── evaluation/      # NDCG and benchmarks
├── utils/           # Logging, text utils
├── config.py        # Centralized config
api/                 # FastAPI serving layer
tests/               # Unit tests

🧪 Running the Pipeline (High Level)

# Preprocess & index data
python src/preprocessing/extract_metadata.py
python src/preprocessing/search_corpus_generation.py

# Build BM25 index
python src/retrieval/bm25_index.py

# Train LTR model
python src/ltr/train_ranker.py

# Evaluate ranking quality
python src/evaluation/ndcg.py

# Start API
uvicorn api.app:app --port 8000

🔌 API Example

GET /search?q=wireless+headphones

Returns top-ranked products with LTR scores and metadata.

⚠️ Known Limitations (Intentional & Acknowledged)

Lexical retrieval only (no semantic embeddings yet)
In-memory index (not distributed)
No personalization or diversity re-ranking

These mirror early-stage real systems and are clearly extensible.

🔮 Future Improvements

Neural retrieval (dense embeddings)
Real user feedback & online learning
Elasticsearch backend
Personalization & diversity
Incremental indexing

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
api		api
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 Search Engine with Learning-to-Rank (Production-Grade)

🚀 What This System Does

🧠 Why This Project Is Strong

✅ Built a Custom Inverted Index

✅ True Industry Two-Stage Architecture

✅ Meaningful Feature Engineering (7 features)

✅ Measurable Offline Evaluation

✅ Production Mindset

🏗️ High-Level Architecture

⚙️ Tech Stack

📂 Project Structure (Simplified)

🧪 Running the Pipeline (High Level)

🔌 API Example

⚠️ Known Limitations (Intentional & Acknowledged)

🔮 Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔍 Search Engine with Learning-to-Rank (Production-Grade)

🚀 What This System Does

🧠 Why This Project Is Strong

✅ Built a Custom Inverted Index

✅ True Industry Two-Stage Architecture

✅ Meaningful Feature Engineering (7 features)

✅ Measurable Offline Evaluation

✅ Production Mindset

🏗️ High-Level Architecture

⚙️ Tech Stack

📂 Project Structure (Simplified)

🧪 Running the Pipeline (High Level)

🔌 API Example

⚠️ Known Limitations (Intentional & Acknowledged)

🔮 Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages