Skip to content

The FaaS MapReduce-based Search Engine is an innovative search platform that leverages the power of Function-as-a-Service (FaaS) and the MapReduce framework to deliver efficient and scalable search capabilities. This project aims to build a highly performant search engine that can handle large volumes of data and provide fast search results.

Notifications You must be signed in to change notification settings

git-avinashpawar/MapReduce-based-Search-Engine

Repository files navigation

MapReduce-based Search Engine (FaaS + MapReduce)

A lightweight search engine that builds an inverted index over a corpus of documents using a MapReduce pipeline. It's designed to be deployed on Function-as-a-Service (FaaS) platforms for elastic, pay-per-use scaling, but can also be run locally for demo and testing.

Live Demo: https://avinashpawar.dev/A5-ECC/index.html


✨ What it does

  • Crawls/ingests a list of documents (e.g., public domain books).
  • Maps tokens → (word, docId, count), reduces to aggregate term frequencies.
  • Merges partial results to produce a compact inverted index.
  • Serves search queries via a minimal web UI (demo) that returns matching document links.

📦 Repository contents

.
├─ A5-ECC.zip            # Front-end UI (open index.html after unzipping)
├─ cloudMapReduce.zip    # Cloud/FaaS pipeline code (packaged)
├─ Mapper.zip            # Mapper stage (tokenize/emit term counts)
├─ Reducer.zip           # Reducer stage (aggregate counts per term)
├─ MergeMapper.zip       # Merge stage - mapper helper
├─ MergeReducer.zip      # Merge stage - reducer helper
├─ books.json            # Sample corpus list (document URLs/metadata)
├─ downloadbooks.sh      # Helper script to download corpus locally
├─ Report.pdf            # Project write-up / design notes
└─ README.md             # (This file)

The code for stages is provided as zip archives to deploy or unzip locally, depending on your environment.


🚀 Quick start

1) Try it online (no setup)

  • Open the Live Demo and follow the on-screen instructions:
    • Use Search to query the pre-indexed dataset.
    • Use Advanced Search to provide your own links and build a small ad-hoc index in the browser-driven flow.

2) Run the UI locally

  1. Unzip A5-ECC.zip.
  2. Open index.html in a browser.
  3. Use the same flow as the hosted demo.

3) Prepare a local corpus (optional)

  1. Review/modify books.json to point to your desired text files (URLs or local references).

  2. Run:

    chmod +x downloadbooks.sh
    ./downloadbooks.sh

    This will fetch the texts into a local folder (e.g., ./books/).

4) Run the MapReduce pipeline

You have two options:

A) Local (simple run for demo/testing) - Unzip Mapper.zip, Reducer.zip, MergeMapper.zip, MergeReducer.zip. - Run the stages on your machine in sequence over the downloaded corpus: 1. Map: tokenize each document → emit (word, docId, count) 2. Reduce: aggregate by (word, docId) → term frequencies 3. Merge: combine shard outputs into a single inverted index file (e.g., JSON/CSV) - Place the final index where the UI expects it (e.g., a data/ or index/ folder referenced by the demo UI).

B) Cloud/FaaS (scalable run) - Unzip cloudMapReduce.zip and follow your provider's steps (AWS Lambda / GCP Cloud Functions / Azure Functions): - Upload mapper/reducer/merge functions. - Wire them with a storage bucket (for inputs/outputs) and a coordinator (can be a simple script or Step Functions/Workflows). - Trigger the pipeline; verify index artifacts are written to storage and exposed to your UI.

Tip: Keep inputs/outputs in object storage (s3://... / gs://...) so each stage reads/writes cleanly.


🔎 Query syntax (UI)

  • Single term: alice
  • Multiple terms: alice wonderland
  • Exact phrase (if supported by your indexer): "alice in wonderland"
  • Basic ranking: simple TF or TF-IDF style scoring (depending on how you configure your reducer/merge logic).

🧠 Design overview

  • Mapper: reads raw text, normalizes (lowercase, strip punctuation), tokenizes, emits (term, docId) → 1.
  • Combiner (optional): local aggregation to cut shuffle volume.
  • Reducer: sums counts per (term, docId) and may compute additional stats (doc length, df, idf).
  • Merge: consolidates shard outputs into a single inverted index file (term → list of (docId, tf[, score])).
  • UI: takes a query, looks up terms in the index, merges posting lists, and returns ranked document links.

⚙️ Configuration

  • Corpus list: edit books.json (URLs, titles, ids).
  • Download script: tweak downloadbooks.sh for paths/timeouts.
  • Normalization: update your mapper's tokenization/stopword rules (inside Mapper.zip).
  • Scoring: change reducer/merge logic for TF-IDF/BM25 (inside reducer/merge zips).

✅ Testing your pipeline

  • Start with a tiny books.json (2--5 docs).
  • Verify intermediate files after Map and Reduce (spot-check a few terms).
  • Confirm the Merge produces one small inverted index the UI can load.
  • Search a few known terms and ensure the expected documents appear.

📈 Performance notes

  • Use a Combiner to reduce shuffle.
  • Shard the corpus evenly to avoid stragglers.
  • For FaaS, keep functions stateless and IO-bound work in object storage.
  • Tune batch size (number of docs per invocation) for concurrency and cold-start trade-offs.

Maintainer: Avinash Pawar
Demo: https://avinashpawar.dev/A5-ECC/index.html

About

The FaaS MapReduce-based Search Engine is an innovative search platform that leverages the power of Function-as-a-Service (FaaS) and the MapReduce framework to deliver efficient and scalable search capabilities. This project aims to build a highly performant search engine that can handle large volumes of data and provide fast search results.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages