Skip to content

saqlain2204/Mini-Search-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mini Search Engine

A simple, scalable, and colorful command-line search engine for text stories using an inverted index with lemmatization and stopword removal. Designed for easy extensibility and local deployment.

Features

  • Inverted Index: Fast word and multi-word search using lemmatization.
  • Stopword Removal: Ignores common English stopwords for smarter search.
  • Scalable: Add new .txt files to the documents/ folder and rerun to update the index.
  • Colorful Terminal Output: Results and prompts are color-coded for clarity.
  • Unique Results: Only unique document titles are shown, with a preview of the first line (up to 50 characters).

Usage

  1. Install dependencies (in your virtual environment):

    pip install -r requirements.txt
  2. Run the search engine:

    python src/main.py
  3. Add your stories:

    • Place your .txt files in the documents/ folder. Each file is a separate document.
  4. Search:

    • Enter a word or phrase at the prompt. The engine will show up to 2 unique matching documents, with the title and a preview.
    • Type exit to quit.

Project Structure

Mini-Search-Engine/
├── documents/           # Your .txt story files go here
├── nltk_data/           # NLTK resources (auto-managed)
├── src/
│   ├── main.py          # Entry point
│   ├── document_manager.py
│   ├── inverted_index.py
│   └── utils/
│       └── terminal_utils.py
├── .gitignore
├── requirements.txt
└── README.md

Customization

  • Add more stories: Just drop new .txt files in documents/ and rerun.
  • Change preview length: Edit the preview length in src/main.py (default: 50 characters).
  • Change number of results: Edit the return value in DocumentManager.search() (default: 2).

Notes

  • All NLTK data is stored locally in nltk_data/ (see .gitignore).
  • Only English stopwords are used for filtering queries.
  • Document titles are shown in uppercase, with underscores and .txt removed.
  • Only the first line of each document is shown, up to 50 characters, with ... if longer.

License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages