Skip to content

DemonicAK/production-search-ranker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 Search Engine with Learning-to-Rank (Production-Grade)

A production-style product search engine implementing a two-stage ranking architecture used in real-world e-commerce systems.

Built end-to-end: indexing → retrieval → feature engineering → learning-to-rank → evaluation → API serving, with measurable ranking improvements and real performance constraints.


🚀 What This System Does

Two-stage ranking pipeline:

  1. Candidate Retrieval (BM25) Fast lexical search to recall top-K relevant products.

  2. Learning-to-Rank Reranking (XGBoost) ML model reranks candidates using engineered relevance, popularity, and quality signals.

📈 Result:

  • BM25 NDCG@10 → 0.614
  • LTR NDCG@10 → 0.649
  • +5.7% improvement

🧠 Why This Project Is Strong

✅ Built a Custom Inverted Index

  • Enables fast document lookup and scalable preprocessing
  • Supports efficient query generation, labeling, and feature extraction
  • Demonstrates real information retrieval fundamentals, not just library usage

✅ True Industry Two-Stage Architecture

BM25 (fast recall, low latency)
        ↓
LTR (precision reranking)
  • Separates latency-critical and quality-critical logic
  • Matches architectures used at Amazon / Flipkart scale
  • Either stage can be upgraded independently

✅ Meaningful Feature Engineering (7 features)

  • Lexical relevance: bm25_score, token_overlap
  • Popularity & quality: log_review_count, avg_rating
  • Normalization & constraints: query_len, title_len, price

Fast, interpretable, production-safe — no black-box embeddings.

✅ Measurable Offline Evaluation

  • Implemented NDCG@k from scratch
  • Compared baseline vs reranker
  • Ranking gains are quantified, not hand-waved

✅ Production Mindset

  • Modular pipeline (restartable at any stage)
  • Centralized configuration
  • Structured logging with rotation
  • FastAPI inference service with health checks
  • Unit tests for core components

🏗️ High-Level Architecture

Raw JSONL (Amazon Electronics)
        ↓
Preprocessing & Indexing
        ↓
BM25 Retrieval (Top-100)
        ↓
Feature Extraction
        ↓
XGBoost LTR Reranking (Top-10)
        ↓
FastAPI Search API

⚙️ Tech Stack

  • Retrieval: BM25 (rank-bm25)
  • Ranking: XGBoost (rank:ndcg)
  • Backend: FastAPI, Uvicorn
  • Data: Pandas, NumPy
  • Evaluation: Custom NDCG@k
  • Infra mindset: Logging, config, tests

📂 Project Structure (Simplified)

src/
├── preprocessing/   # Data cleaning, indexing, labeling
├── retrieval/       # BM25 index & search
├── features/        # Feature engineering
├── ltr/             # Model training & inference
├── evaluation/      # NDCG and benchmarks
├── utils/           # Logging, text utils
├── config.py        # Centralized config
api/                 # FastAPI serving layer
tests/               # Unit tests

🧪 Running the Pipeline (High Level)

# Preprocess & index data
python src/preprocessing/extract_metadata.py
python src/preprocessing/search_corpus_generation.py

# Build BM25 index
python src/retrieval/bm25_index.py

# Train LTR model
python src/ltr/train_ranker.py

# Evaluate ranking quality
python src/evaluation/ndcg.py

# Start API
uvicorn api.app:app --port 8000

🔌 API Example

GET /search?q=wireless+headphones

Returns top-ranked products with LTR scores and metadata.


⚠️ Known Limitations (Intentional & Acknowledged)

  • Lexical retrieval only (no semantic embeddings yet)
  • In-memory index (not distributed)
  • No personalization or diversity re-ranking

These mirror early-stage real systems and are clearly extensible.


🔮 Future Improvements

  • Neural retrieval (dense embeddings)
  • Real user feedback & online learning
  • Elasticsearch backend
  • Personalization & diversity
  • Incremental indexing

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages