Skip to content

LoayAhmed304/search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

$\color{rgba(240, 171, )}{\textsf{Hola! Can you say "The best search engine ever" with me? }}$

🔍 Search Engine Project

A full-stack search engine built with Java/Spring Boot backend and Vue.js frontend

Project Overview

This is a high-performance search engine that crawls web pages, indexes content, calculates PageRank scores, and provides a modern web interface for searching. The system is designed with scalability and performance in mind, featuring multi-threaded crawling, efficient indexing, and intelligent ranking algorithms.

Video Preview

Seo-Preview-V7.mp4

✨ Features

🕷️ Web Crawler

  • Multi-threaded Architecture: Configurable thread pool (default: 20 threads)
  • High Performance: Crawls 1000 documents in under 1 minute using 5 threads
  • Smart Batching: Prioritizes popular pages using frequency-based batching
  • Robots.txt Compliance: Respects web server policies with robust caching
  • Duplicate Detection: Content hashing prevents redundant processing
  • URL Normalization: Standardizes and filters invalid URLs
  • Compression: Stores crawled content efficiently
Optimization Techniques
  • Uses documents compression and decompression to store data of much less size in the database for faster operations.
  • The RobotsHandler implements a domain-based caching system, maps hostnames to parsed robots.txt rules, ensuring each domain's rules are fetched only once regardless of how many URLs from that domain are crawled.

🧾 Indexer

Transforms HTML documents into inverted indices for fast search

  • Advanced Tokenization: Intelligent text processing and cleanup
  • Stop Word Filtering: Removes common words for better relevance
  • Stemming Support: Reduces words to their root forms
  • Field Extraction: Processes titles, headers, and content separately
  • Efficient Storage: Optimized database operations

📊 Ranking System

Ranks pages based on their PageRank, TF, and IDF scores

  • TF-IDF scoring for term relevance per page
  • Normalized PageRank influence for domain authority
    • Using PageRank algorithm, which takes ~10ms on 6,000 documents
  • Structural field boosts for <title> and <h1> tag matches
  • Penalty applied for missing <h1> tags
  • Score capping to avoid overinflation
  • Computes PageRank as an offline process for all crawled URLs
  • Optimized database operations for fetching & saving ranks,
    • targets: <200ms for 6,000 documents.
    • Ranking logic runs in 8~50ms depending on query

🔍 Query processing & Phrase Searching

Processes user queries, supports phrase search, and generates result snippets.

  • Unified Tokenization: Applies the same cleanup, stemming, and stop-word removal as the indexer to ensure consistency
  • Exact Phrase Search: Supports precise phrase matching, even when stop words are present
  • Multi-threading: Speeds up snippet generation and phrase matching
  • Fast Response Times:
    • General query: 0.01 – 0.2 seconds
    • Phrase search: < 0.3 seconds

Optimization Techniques

  • Snippet generation is triggered only when the corresponding result page is requested
  • Uses token position lookup from the inverted index — avoids full document scans (no regex!)
  • Early filtering (before stemming) to narrow down the result set for phrase searching queries

Flow

📝 Development Guidelines

  • Follow the conventions outlined in Project-Guidelines.md

🛠️ How to Run

Prerequisites

  • MongoDB must be running locally
  • Maven should be installed and available in your terminal
  • Ensure your application.properties file is properly configured
  • Clone the repository and navigate to the project root directory

Backend Setup

  1. Open a terminal and navigate to the backend directory:

    cd engine
  2. Run Spring Boot

    mvn spring-boot:run

To start each module in order:

  1. Run the Crawler (default thread count is 20)
make crawl THREADS=10
  1. Run the Indexer
make index
  1. Run the PageRank Module
make pagerank
  1. Run the Ranker (with an optional query)
make rank QUERY="your search terms"

Frontend Setup

  1. Open a new terminal and go to the frontend directory:

    cd client
  2. Install dependencies:

    npm install
  3. Start the development server:

    npm run dev
  4. Open your browser and visit: http://localhost:5173

👥 Contributors

Typing SVG

Habiba Ayman

Tasneem Mohamed

Loay Ahmed

Helana Nady

About

A high-performance search engine that crawls web pages, indexes content, calculates PageRank scores, and provides modern UI for searching.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors