Skip to content

Fast and accurate search engine across UCI's ICS domain. Combines modern information retrieval techniques with efficient data structures to deliver results in milliseconds.

Notifications You must be signed in to change notification settings

jasperdoan/ics-search-engine

Repository files navigation

Search Engine Overview

Our search engine provides fast and accurate search capabilities across UCI's ICS domain. It combines modern information retrieval techniques with efficient data structures to deliver results in milliseconds.

Performance Metrics

Query Type Response Time
Single term 10-100ms
Multi-term 100-200ms
Complex (5+ terms) 200-300ms

Core Architecture

The system is built on four main components working in harmony:

  1. Document Processing Pipeline
Component Function
HTML Parser Extracts clean text from web pages using BeautifulSoup4
Text Analyzer Identifies important content from headers and titles
Duplicate Detector Prevents index bloat using SimHash algorithm
  1. Search Algorithm
Feature Description
TF-IDF Scoring Measures term importance in documents
Cosine Similarity Computes relevance between query and documents
PageRank & HITS Incorporates web graph authority signals
  1. Index Management
Strategy Implementation
Storage Hybrid Pickle/JSON for optimal speed/space tradeoff
Access Peek-based retrieval to minimize memory usage
Caching LRU cache for frequent terms and queries
  1. Query Processing
Stage Operation
Tokenization NLTK-based text normalization
Stemming Porter stemming for word variations
Ranking Multi-factor score combining relevance signals

Technical Implementation

The codebase is organized into focused modules:

Data Structures

Document

@dataclass
class Document:
    url: str                    # Document URL
    content: str                # Processed raw text content
    doc_id: int                 # Unique document identifier
    simhash: str                # SimHash fingerprint for deduplication
    token_count: int            # Number of tokens in document
    outgoing_links: List[str]   # Outgoing URLs for link analysis

Posting

@dataclass
class Posting:
    doc_id: int            # Document identifier
    frequency: int         # Term frequency in document
    importance: float      # Combined weight from HTML tags
    tf_idf: float          # Term frequency-inverse document frequency score
    positions: List[int]   # Token positions for phrase queries

Index Structure

{
    "term1": [Posting1, Posting2, ...],
    "term2": [Posting3, Posting4, ...],
    ...
}

Usage

  1. Build the index:
python3 indexer.py
  1. Start the search engine:
# For UI
streamlit run main.py

# For CLI
python3 search.py

Requirements

  • Python 3.7+
  • Streamlit
  • NLTK
  • BeautifulSoup4
  • NumPy
  • SciPy
  • scikit-learn

About

Fast and accurate search engine across UCI's ICS domain. Combines modern information retrieval techniques with efficient data structures to deliver results in milliseconds.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages