MapReduce-based Search Engine (FaaS + MapReduce)

A lightweight search engine that builds an inverted index over a corpus of documents using a MapReduce pipeline. It's designed to be deployed on Function-as-a-Service (FaaS) platforms for elastic, pay-per-use scaling, but can also be run locally for demo and testing.

Live Demo: https://avinashpawar.dev/A5-ECC/index.html

✨ What it does

Crawls/ingests a list of documents (e.g., public domain books).
Maps tokens → (word, docId, count), reduces to aggregate term frequencies.
Merges partial results to produce a compact inverted index.
Serves search queries via a minimal web UI (demo) that returns matching document links.

📦 Repository contents

.
├─ A5-ECC.zip            # Front-end UI (open index.html after unzipping)
├─ cloudMapReduce.zip    # Cloud/FaaS pipeline code (packaged)
├─ Mapper.zip            # Mapper stage (tokenize/emit term counts)
├─ Reducer.zip           # Reducer stage (aggregate counts per term)
├─ MergeMapper.zip       # Merge stage - mapper helper
├─ MergeReducer.zip      # Merge stage - reducer helper
├─ books.json            # Sample corpus list (document URLs/metadata)
├─ downloadbooks.sh      # Helper script to download corpus locally
├─ Report.pdf            # Project write-up / design notes
└─ README.md             # (This file)

The code for stages is provided as zip archives to deploy or unzip locally, depending on your environment.

🚀 Quick start

1) Try it online (no setup)

Open the Live Demo and follow the on-screen instructions:
- Use Search to query the pre-indexed dataset.
- Use Advanced Search to provide your own links and build a small ad-hoc index in the browser-driven flow.

2) Run the UI locally

Unzip A5-ECC.zip.
Open index.html in a browser.
Use the same flow as the hosted demo.

3) Prepare a local corpus (optional)

Review/modify books.json to point to your desired text files (URLs or local references).
Run:
```
chmod +x downloadbooks.sh
./downloadbooks.sh
```
This will fetch the texts into a local folder (e.g., ./books/).

4) Run the MapReduce pipeline

You have two options:

A) Local (simple run for demo/testing) - Unzip Mapper.zip, Reducer.zip, MergeMapper.zip, MergeReducer.zip. - Run the stages on your machine in sequence over the downloaded corpus: 1. Map: tokenize each document → emit (word, docId, count) 2. Reduce: aggregate by (word, docId) → term frequencies 3. Merge: combine shard outputs into a single inverted index file (e.g., JSON/CSV) - Place the final index where the UI expects it (e.g., a data/ or index/ folder referenced by the demo UI).

B) Cloud/FaaS (scalable run) - Unzip cloudMapReduce.zip and follow your provider's steps (AWS Lambda / GCP Cloud Functions / Azure Functions): - Upload mapper/reducer/merge functions. - Wire them with a storage bucket (for inputs/outputs) and a coordinator (can be a simple script or Step Functions/Workflows). - Trigger the pipeline; verify index artifacts are written to storage and exposed to your UI.

Tip: Keep inputs/outputs in object storage (s3://... / gs://...) so each stage reads/writes cleanly.

🔎 Query syntax (UI)

Single term: alice
Multiple terms: alice wonderland
Exact phrase (if supported by your indexer): "alice in wonderland"
Basic ranking: simple TF or TF-IDF style scoring (depending on how you configure your reducer/merge logic).

🧠 Design overview

Mapper: reads raw text, normalizes (lowercase, strip punctuation), tokenizes, emits (term, docId) → 1.
Combiner (optional): local aggregation to cut shuffle volume.
Reducer: sums counts per (term, docId) and may compute additional stats (doc length, df, idf).
Merge: consolidates shard outputs into a single inverted index file (term → list of (docId, tf[, score])).
UI: takes a query, looks up terms in the index, merges posting lists, and returns ranked document links.

⚙️ Configuration

Corpus list: edit books.json (URLs, titles, ids).
Download script: tweak downloadbooks.sh for paths/timeouts.
Normalization: update your mapper's tokenization/stopword rules (inside Mapper.zip).
Scoring: change reducer/merge logic for TF-IDF/BM25 (inside reducer/merge zips).

✅ Testing your pipeline

Start with a tiny books.json (2--5 docs).
Verify intermediate files after Map and Reduce (spot-check a few terms).
Confirm the Merge produces one small inverted index the UI can load.
Search a few known terms and ensure the expected documents appear.

📈 Performance notes

Use a Combiner to reduce shuffle.
Shard the corpus evenly to avoid stragglers.
For FaaS, keep functions stateless and IO-bound work in object storage.
Tune batch size (number of docs per invocation) for concurrency and cold-start trade-offs.

Maintainer: Avinash Pawar
Demo: https://avinashpawar.dev/A5-ECC/index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MapReduce-based Search Engine (FaaS + MapReduce)

✨ What it does

📦 Repository contents

🚀 Quick start

1) Try it online (no setup)

2) Run the UI locally

3) Prepare a local corpus (optional)

4) Run the MapReduce pipeline

🔎 Query syntax (UI)

🧠 Design overview

⚙️ Configuration

✅ Testing your pipeline

📈 Performance notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
A5-ECC.zip		A5-ECC.zip
Mapper.zip		Mapper.zip
MergeMapper.zip		MergeMapper.zip
MergeReducer.zip		MergeReducer.zip
README.md		README.md
Reducer.zip		Reducer.zip
Report.pdf		Report.pdf
books.json		books.json
cloudMapReduce.zip		cloudMapReduce.zip
downloadbooks.sh		downloadbooks.sh

git-avinashpawar/MapReduce-based-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

MapReduce-based Search Engine (FaaS + MapReduce)

✨ What it does

📦 Repository contents

🚀 Quick start

1) Try it online (no setup)

2) Run the UI locally

3) Prepare a local corpus (optional)

4) Run the MapReduce pipeline

🔎 Query syntax (UI)

🧠 Design overview

⚙️ Configuration

✅ Testing your pipeline

📈 Performance notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages