A fully functional search engine built from scratch using Python, demonstrating core information retrieval concepts.
This project implements the fundamental concepts that power all search engines:
- Inverted Index - Maps words to documents (the core data structure)
- TF-IDF Ranking - Calculates document relevance scores
- Tokenization - Breaks text into searchable words
- Query Processing - Handles search queries and finds matches
- RESTful API Design - Exposes search functionality via HTTP
Frontend (HTML/JS) → FastAPI Backend → Search Engine Core → Inverted Index
search_engine.py- Core search logic (indexing, TF-IDF, ranking)main.py- FastAPI REST API endpointsindex.html- Web interface for searchingsample_documents.py- Pre-loaded sample documents
pip install -r requirements.txtpython main.pyOr using uvicorn directly:
uvicorn main:app --reloadNavigate to: http://localhost:8000
The server will automatically load 10 sample documents about programming topics.
GET /search?q=python&top_k=10
Returns ranked search results for the query.
POST /documents
Content-Type: application/json
{
"title": "My Document",
"content": "Document content here..."
}
Adds a new document to the search index.
GET /documents
Returns all indexed documents.
GET /stats
Returns search engine statistics (total documents, words, etc.).
Visit http://localhost:8000/docs for automatic Swagger UI documentation.
When you add a document:
- Tokenization: Text is broken into words (lowercase, punctuation removed)
- Inverted Index Building: For each word, we record which documents contain it
- Statistics: We track word frequencies for TF-IDF calculations
Example:
Document 1: "Python is great"
Document 2: "Python web apps"
Inverted Index:
"python" → [Doc1, Doc2]
"great" → [Doc1]
"web" → [Doc2]
"apps" → [Doc2]
When you search for "python":
- Query Tokenization: "python" → ["python"]
- Find Candidates: Use inverted index to find documents containing "python"
- Calculate Scores: For each candidate, calculate TF-IDF score
- Rank Results: Sort by score (highest first)
- Return Top K: Return the top results
TF (Term Frequency): How often a word appears in a document
TF = count(word in document) / total words in document
IDF (Inverse Document Frequency): How rare/common a word is
IDF = log(total documents / documents containing word)
TF-IDF Score:
Score = TF × IDF
Why this works:
- Common words (like "the", "is") have low IDF → low scores
- Rare, relevant words have high IDF → high scores
- Documents with more occurrences of query words get higher TF → higher scores
Instead of storing: Document → Words
We store: Word → Documents
Why? Searching becomes O(1) lookup instead of scanning all documents!
Breaking text into searchable units:
- "Python is great!" → ["python", "is", "great"]
- Handles: lowercase conversion, punctuation removal, whitespace splitting
Not all matches are equal. TF-IDF scores documents by:
- Relevance: How well does it match the query?
- Importance: How important are the matching words?
- Quality: Documents with more relevant content rank higher
Use the web interface or API:
POST /documents
{
"title": "Your Title",
"content": "Your content here..."
}Modify the _calculate_tfidf method in search_engine.py to experiment with different scoring algorithms.
Modify the top_k parameter in the search endpoint (default: 10).
Want to extend this? Here are ideas:
- BM25 Ranking - Improved ranking algorithm (better than TF-IDF)
- Vector Embeddings - Semantic search using word embeddings
- Database Storage - Persist index to database (SQLite, PostgreSQL)
- Autocomplete - Suggest queries as user types
- Faceted Search - Filter by categories/tags
- Multi-field Search - Search in title, content, tags separately
- Fuzzy Matching - Handle typos and misspellings
- Pagination - Handle large result sets
- Caching - Cache frequent queries
- Distributed Search - Scale across multiple servers
Port already in use?
uvicorn main:app --port 8001CORS errors?
The frontend is configured to work with localhost:8000. If using a different port, update API_BASE in index.html.
No results? Make sure sample documents loaded. Check server logs on startup.
Feel free to use and modify as needed!