Skip to content

Latest commit

 

History

History
60 lines (48 loc) · 2.16 KB

File metadata and controls

60 lines (48 loc) · 2.16 KB

Mini Search Engine

A simple, scalable, and colorful command-line search engine for text stories using an inverted index with lemmatization and stopword removal. Designed for easy extensibility and local deployment.

Features

  • Inverted Index: Fast word and multi-word search using lemmatization.
  • Stopword Removal: Ignores common English stopwords for smarter search.
  • Scalable: Add new .txt files to the documents/ folder and rerun to update the index.
  • Colorful Terminal Output: Results and prompts are color-coded for clarity.
  • Unique Results: Only unique document titles are shown, with a preview of the first line (up to 50 characters).

Usage

  1. Install dependencies (in your virtual environment):

    pip install -r requirements.txt
  2. Run the search engine:

    python src/main.py
  3. Add your stories:

    • Place your .txt files in the documents/ folder. Each file is a separate document.
  4. Search:

    • Enter a word or phrase at the prompt. The engine will show up to 2 unique matching documents, with the title and a preview.
    • Type exit to quit.

Project Structure

Mini-Search-Engine/
├── documents/           # Your .txt story files go here
├── nltk_data/           # NLTK resources (auto-managed)
├── src/
│   ├── main.py          # Entry point
│   ├── document_manager.py
│   ├── inverted_index.py
│   └── utils/
│       └── terminal_utils.py
├── .gitignore
├── requirements.txt
└── README.md

Customization

  • Add more stories: Just drop new .txt files in documents/ and rerun.
  • Change preview length: Edit the preview length in src/main.py (default: 50 characters).
  • Change number of results: Edit the return value in DocumentManager.search() (default: 2).

Notes

  • All NLTK data is stored locally in nltk_data/ (see .gitignore).
  • Only English stopwords are used for filtering queries.
  • Document titles are shown in uppercase, with underscores and .txt removed.
  • Only the first line of each document is shown, up to 50 characters, with ... if longer.

License

MIT License