A lightweight search engine that builds an inverted index over a corpus of documents using a MapReduce pipeline. It's designed to be deployed on Function-as-a-Service (FaaS) platforms for elastic, pay-per-use scaling, but can also be run locally for demo and testing.
Live Demo: https://avinashpawar.dev/A5-ECC/index.html
- Crawls/ingests a list of documents (e.g., public domain books).
- Maps tokens → (word, docId, count), reduces to aggregate term frequencies.
- Merges partial results to produce a compact inverted index.
- Serves search queries via a minimal web UI (demo) that returns matching document links.
.
├─ A5-ECC.zip # Front-end UI (open index.html after unzipping)
├─ cloudMapReduce.zip # Cloud/FaaS pipeline code (packaged)
├─ Mapper.zip # Mapper stage (tokenize/emit term counts)
├─ Reducer.zip # Reducer stage (aggregate counts per term)
├─ MergeMapper.zip # Merge stage - mapper helper
├─ MergeReducer.zip # Merge stage - reducer helper
├─ books.json # Sample corpus list (document URLs/metadata)
├─ downloadbooks.sh # Helper script to download corpus locally
├─ Report.pdf # Project write-up / design notes
└─ README.md # (This file)
The code for stages is provided as zip archives to deploy or unzip locally, depending on your environment.
- Open the Live Demo and follow the on-screen instructions:
- Use Search to query the pre-indexed dataset.
- Use Advanced Search to provide your own links and build a small ad-hoc index in the browser-driven flow.
- Unzip
A5-ECC.zip
. - Open
index.html
in a browser. - Use the same flow as the hosted demo.
-
Review/modify
books.json
to point to your desired text files (URLs or local references). -
Run:
chmod +x downloadbooks.sh ./downloadbooks.sh
This will fetch the texts into a local folder (e.g.,
./books/
).
You have two options:
A) Local (simple run for demo/testing) - Unzip Mapper.zip
,
Reducer.zip
, MergeMapper.zip
, MergeReducer.zip
. - Run the stages
on your machine in sequence over the downloaded corpus: 1. Map:
tokenize each document → emit (word, docId, count)
2. Reduce:
aggregate by (word, docId)
→ term frequencies 3. Merge: combine
shard outputs into a single inverted index file (e.g., JSON/CSV) - Place
the final index where the UI expects it (e.g., a data/
or index/
folder referenced by the demo UI).
B) Cloud/FaaS (scalable run) - Unzip cloudMapReduce.zip
and follow
your provider's steps (AWS Lambda / GCP Cloud Functions / Azure
Functions): - Upload mapper/reducer/merge functions. - Wire them
with a storage bucket (for inputs/outputs) and a coordinator (can be a
simple script or Step Functions/Workflows). - Trigger the pipeline;
verify index artifacts are written to storage and exposed to your UI.
Tip: Keep inputs/outputs in object storage (
s3://...
/gs://...
) so each stage reads/writes cleanly.
- Single term:
alice
- Multiple terms:
alice wonderland
- Exact phrase (if supported by your indexer):
"alice in wonderland"
- Basic ranking: simple TF or TF-IDF style scoring (depending on how you configure your reducer/merge logic).
- Mapper: reads raw text, normalizes (lowercase, strip
punctuation), tokenizes, emits
(term, docId) → 1
. - Combiner (optional): local aggregation to cut shuffle volume.
- Reducer: sums counts per
(term, docId)
and may compute additional stats (doc length, df, idf). - Merge: consolidates shard outputs into a single inverted index
file (term → list of
(docId, tf[, score])
). - UI: takes a query, looks up terms in the index, merges posting lists, and returns ranked document links.
- Corpus list: edit
books.json
(URLs, titles, ids). - Download script: tweak
downloadbooks.sh
for paths/timeouts. - Normalization: update your mapper's tokenization/stopword rules
(inside
Mapper.zip
). - Scoring: change reducer/merge logic for TF-IDF/BM25 (inside reducer/merge zips).
- Start with a tiny
books.json
(2--5 docs). - Verify intermediate files after Map and Reduce (spot-check a few terms).
- Confirm the Merge produces one small inverted index the UI can load.
- Search a few known terms and ensure the expected documents appear.
- Use a Combiner to reduce shuffle.
- Shard the corpus evenly to avoid stragglers.
- For FaaS, keep functions stateless and IO-bound work in object storage.
- Tune batch size (number of docs per invocation) for concurrency and cold-start trade-offs.
Maintainer: Avinash Pawar
Demo: https://avinashpawar.dev/A5-ECC/index.html