Our search engine provides fast and accurate search capabilities across UCI's ICS domain. It combines modern information retrieval techniques with efficient data structures to deliver results in milliseconds.
| Query Type | Response Time |
|---|---|
| Single term | 10-100ms |
| Multi-term | 100-200ms |
| Complex (5+ terms) | 200-300ms |
The system is built on four main components working in harmony:
- Document Processing Pipeline
| Component | Function |
|---|---|
| HTML Parser | Extracts clean text from web pages using BeautifulSoup4 |
| Text Analyzer | Identifies important content from headers and titles |
| Duplicate Detector | Prevents index bloat using SimHash algorithm |
- Search Algorithm
| Feature | Description |
|---|---|
| TF-IDF Scoring | Measures term importance in documents |
| Cosine Similarity | Computes relevance between query and documents |
| PageRank & HITS | Incorporates web graph authority signals |
- Index Management
| Strategy | Implementation |
|---|---|
| Storage | Hybrid Pickle/JSON for optimal speed/space tradeoff |
| Access | Peek-based retrieval to minimize memory usage |
| Caching | LRU cache for frequent terms and queries |
- Query Processing
| Stage | Operation |
|---|---|
| Tokenization | NLTK-based text normalization |
| Stemming | Porter stemming for word variations |
| Ranking | Multi-factor score combining relevance signals |
The codebase is organized into focused modules:
search.py: Core search logic and rankingindexer.py: Document processing and index buildingtoken_processor.py: Text analysis and normalizationdocument_processor.py: HTML handling and deduplication
@dataclass
class Document:
url: str # Document URL
content: str # Processed raw text content
doc_id: int # Unique document identifier
simhash: str # SimHash fingerprint for deduplication
token_count: int # Number of tokens in document
outgoing_links: List[str] # Outgoing URLs for link analysis@dataclass
class Posting:
doc_id: int # Document identifier
frequency: int # Term frequency in document
importance: float # Combined weight from HTML tags
tf_idf: float # Term frequency-inverse document frequency score
positions: List[int] # Token positions for phrase queries{
"term1": [Posting1, Posting2, ...],
"term2": [Posting3, Posting4, ...],
...
}- Build the index:
python3 indexer.py- Start the search engine:
# For UI
streamlit run main.py
# For CLI
python3 search.py- Python 3.7+
- Streamlit
- NLTK
- BeautifulSoup4
- NumPy
- SciPy
- scikit-learn