🔍 Search Engine Project

$\color{rgba(240, 171, )}{\textsf{Hola! Can you say "The best search engine ever" with me? }}$

🔍 Search Engine Project

A full-stack search engine built with Java/Spring Boot backend and Vue.js frontend

Project Overview

This is a high-performance search engine that crawls web pages, indexes content, calculates PageRank scores, and provides a modern web interface for searching. The system is designed with scalability and performance in mind, featuring multi-threaded crawling, efficient indexing, and intelligent ranking algorithms.

Video Preview

Seo-Preview-V7.mp4

✨ Features

🕷️ Web Crawler

Multi-threaded Architecture: Configurable thread pool (default: 20 threads)
High Performance: Crawls 1000 documents in under 1 minute using 5 threads
Smart Batching: Prioritizes popular pages using frequency-based batching
Robots.txt Compliance: Respects web server policies with robust caching
Duplicate Detection: Content hashing prevents redundant processing
URL Normalization: Standardizes and filters invalid URLs
Compression: Stores crawled content efficiently

Optimization Techniques

Uses documents compression and decompression to store data of much less size in the database for faster operations.
The RobotsHandler implements a domain-based caching system, maps hostnames to parsed robots.txt rules, ensuring each domain's rules are fetched only once regardless of how many URLs from that domain are crawled.

🧾 Indexer

Transforms HTML documents into inverted indices for fast search

Advanced Tokenization: Intelligent text processing and cleanup
Stop Word Filtering: Removes common words for better relevance
Stemming Support: Reduces words to their root forms
Field Extraction: Processes titles, headers, and content separately
Efficient Storage: Optimized database operations

📊 Ranking System

Ranks pages based on their PageRank, TF, and IDF scores

TF-IDF scoring for term relevance per page
Normalized PageRank influence for domain authority
- Using PageRank algorithm, which takes ~10ms on 6,000 documents
Structural field boosts for <title> and <h1> tag matches
Penalty applied for missing <h1> tags
Score capping to avoid overinflation
Computes PageRank as an offline process for all crawled URLs
Optimized database operations for fetching & saving ranks,
- targets: <200ms for 6,000 documents.
- Ranking logic runs in 8~50ms depending on query

🔍 Query processing & Phrase Searching

Processes user queries, supports phrase search, and generates result snippets.

Unified Tokenization: Applies the same cleanup, stemming, and stop-word removal as the indexer to ensure consistency
Exact Phrase Search: Supports precise phrase matching, even when stop words are present
Multi-threading: Speeds up snippet generation and phrase matching
Fast Response Times:
- General query: 0.01 – 0.2 seconds
- Phrase search: < 0.3 seconds

Optimization Techniques

Snippet generation is triggered only when the corresponding result page is requested
Uses token position lookup from the inverted index — avoids full document scans (no regex!)
Early filtering (before stemming) to narrow down the result set for phrase searching queries

Flow

📝 Development Guidelines

Follow the conventions outlined in Project-Guidelines.md

🛠️ How to Run

Prerequisites

MongoDB must be running locally
Maven should be installed and available in your terminal
Ensure your application.properties file is properly configured
Clone the repository and navigate to the project root directory

Backend Setup

Open a terminal and navigate to the backend directory:
```
cd engine
```
Run Spring Boot
```
mvn spring-boot:run
```

To start each module in order:

Run the Crawler (default thread count is 20)

make crawl THREADS=10

Run the Indexer

make index

Run the PageRank Module

make pagerank

Run the Ranker (with an optional query)

make rank QUERY="your search terms"

Frontend Setup

Open a new terminal and go to the frontend directory:
```
cd client
```
Install dependencies:
```
npm install
```
Start the development server:
```
npm run dev
```
Open your browser and visit: http://localhost:5173

👥 Contributors

_{Habiba Ayman}

_{Tasneem Mohamed}

_{Loay Ahmed}

_{Helana Nady}

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
client		client
engine		engine
LICENSE		LICENSE
Project-Guidelines.md		Project-Guidelines.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

$\color{rgba(240, 171, )}{\textsf{Hola! Can you say "The best search engine ever" with me? }}$

🔍 Search Engine Project

Project Overview

Video Preview

✨ Features

🕷️ Web Crawler

Optimization Techniques

🧾 Indexer

📊 Ranking System

🔍 Query processing & Phrase Searching

Optimization Techniques

Flow

📝 Development Guidelines

🛠️ How to Run

Prerequisites

Backend Setup

Frontend Setup

👥 Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

$\color{rgba(240, 171, )}{\textsf{Hola! Can you say "The best search engine ever" with me? }}$

🔍 Search Engine Project

Project Overview

Video Preview

✨ Features

🕷️ Web Crawler

Optimization Techniques

🧾 Indexer

📊 Ranking System

🔍 Query processing & Phrase Searching

Optimization Techniques

Flow

📝 Development Guidelines

🛠️ How to Run

Prerequisites

Backend Setup

Frontend Setup

👥 Contributors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages