A simple, scalable, and colorful command-line search engine for text stories using an inverted index with lemmatization and stopword removal. Designed for easy extensibility and local deployment.
- Inverted Index: Fast word and multi-word search using lemmatization.
- Stopword Removal: Ignores common English stopwords for smarter search.
- Scalable: Add new
.txtfiles to thedocuments/folder and rerun to update the index. - Colorful Terminal Output: Results and prompts are color-coded for clarity.
- Unique Results: Only unique document titles are shown, with a preview of the first line (up to 50 characters).
-
Install dependencies (in your virtual environment):
pip install -r requirements.txt
-
Run the search engine:
python src/main.py
-
Add your stories:
- Place your
.txtfiles in thedocuments/folder. Each file is a separate document.
- Place your
-
Search:
- Enter a word or phrase at the prompt. The engine will show up to 2 unique matching documents, with the title and a preview.
- Type
exitto quit.
Mini-Search-Engine/
├── documents/ # Your .txt story files go here
├── nltk_data/ # NLTK resources (auto-managed)
├── src/
│ ├── main.py # Entry point
│ ├── document_manager.py
│ ├── inverted_index.py
│ └── utils/
│ └── terminal_utils.py
├── .gitignore
├── requirements.txt
└── README.md
- Add more stories: Just drop new
.txtfiles indocuments/and rerun. - Change preview length: Edit the preview length in
src/main.py(default: 50 characters). - Change number of results: Edit the return value in
DocumentManager.search()(default: 2).
- All NLTK data is stored locally in
nltk_data/(see.gitignore). - Only English stopwords are used for filtering queries.
- Document titles are shown in uppercase, with underscores and
.txtremoved. - Only the first line of each document is shown, up to 50 characters, with
...if longer.
MIT License