WebScraper and Content Analysis

📖 Project Concept

WebScraper and Content Analysis is a microservice-based system designed to scrape, analyze, and make web content searchable.

Users submit URLs to be scraped. The system then:

Scrapes and cleans the content from webpages.
Analyzes the text using AI (LLM API) to categorize articles, summarize them, and extract keywords.
Stores enriched results in Elasticsearch for fast and powerful querying.

This architecture is designed with scalability, fault tolerance, and clean separation of responsibilities in mind.

🏗️ System Architecture

The project is built around event-driven microservices connected by message queues and streaming platforms.

Orchestrator (Go): Accepts scraping jobs, stores metadata in PostgreSQL, splits jobs into tasks, and queues them in RabbitMQ.
Scraper Workers (Go): Consume tasks, fetch HTML, clean text, and publish raw content into Kafka.
AI Analysis Worker (Python): Consumes Kafka messages, sends text to Groq LLM API for categorization, summarization, and keyword extraction, then stores results in Elasticsearch.
Query Service (Go): Provides REST API endpoints to query the enriched content from Elasticsearch.

🛠️ Technologies Used

WebScraper and Content Analysis is built using the following technologies:

Go (Golang) → For Orchestrator, Scraper, and Query services. Provides high performance and concurrency for distributed tasks.
Python → For AI Analysis Worker. Simplifies integration with Groq LLM API.
React → Modern frontend framework for building responsive user interfaces.
PostgreSQL → Stores job metadata including job ID, URL, and status.
RabbitMQ → A reliable message broker used to distribute scraping tasks to workers.
Kafka → A streaming platform used for handling large-scale raw content pipelines.
Elasticsearch → Full-text search engine storing enriched content and enabling fast queries.
Groq LLM API → Provides AI-based summarization, categorization, and keyword extraction.
Docker & Docker Compose → For containerized deployment of databases, brokers, and services.

✨ Features

Modern Web Interface:clean, responsive React frontend for easy interaction
Job Orchestration: Submit jobs with URLs, track their progress in PostgreSQL.
Scalable Scraping: Multiple workers can scrape URLs concurrently.
AI Enrichment: Articles are categorized, summarized into 3 bullet points, and keywords extracted.
Full-Text Search: Search enriched data stored in Elasticsearch via REST API.
Resilient & Modular: Each service can be scaled independently.

⚡ Getting Started

🔹 Setup Method : Run with Docker Compose

This is the easiest way to get all dependencies (PostgreSQL, RabbitMQ, Kafka, Elasticsearch) and services up and running.

Clone the repository:

git clone https://github.com/your-username/webscraperandcontentanalysis.git
cd webscraperandcontentanalysis

Set up environment variables:

cp .env.example .env

Edit the .env file and add your configuration:

# Database
DATABASE_URL=postgresql://user:password@postgres:5432/webscraper?sslmode=disable

# RabbitMQ
RABBITMQ_URL=amqp://user:password@rabbitmq:5672/
QUEUE_NAME=scraper_jobs

# Kafka
KAFKA_BOOTSTRAP_SERVICE=kafka:9092
KAFKA_RAW_CONTENT_TOPIC=raw_content

# Groq API
GROQ_API_KEY=your_groq_api_key_here

# Elasticsearch
ELASTICSEARCH_HOST=http://elasticsearch:9200
ELASTICSEARCH_INDEX=web_content

Run with Docker Compose:
```
docker-compose up -d
```
Verify services are running:
```
docker-compose ps
```

2. Set up Environment Variables

export DATABASE_URL="postgresql://user:password@localhost:5432/webscraper?sslmode=disable"
export RABBITMQ_URL="amqp://user:password@localhost:5672/"
export QUEUE_NAME="scraper_jobs"
export KAFKA_BOOTSTRAP_SERVICE="localhost:9092"
export KAFKA_RAW_CONTENT_TOPIC="raw_content"
export GROQ_API_KEY="your_groq_api_key_here"
export ELASTICSEARCH_HOST="http://localhost:9200"
export ELASTICSEARCH_INDEX="web_content"

3. Run Backend Services in Separate Terminals

Terminal 1 - Orchestrator:

go run -C orchestrator cmd

Terminal 2 - Scraper Worker:

go run -C scraper-worker cmd/worker

Terminal 3 - AI Analysis Worker:

cd ai-worker
pip install -r requirements.txt
python3 main.py

Terminal 4 - Query Service:

go run -C query-service cmd/main.go

4. Run Frontend Application

Terminal 5 - React Frontend:

cd client
npm install 
npm run dev

The frontend will be available at http://localhost:5173

📁 Project Structure

webscraperandcontentanalysis/
├── client/                 # React + Vite frontend application
│   ├── src/               # Source code
│   ├── public/            # Static assets
│   └── package.json       # Frontend dependencies
├── orchestrator/          # Go service for job management
├── scraper-worker/        # Go service for web scraping
├── ai-worker/            # Python service for AI analysis
├── query-service/        # Go service for Elasticsearch queries
├── docker-compose.yml    # Docker configuration
├── .env.example         # Environment template
└── README.md           # This file

🔧 Configuration

Environment Variables

Variable	Description	Default
`DATABASE_URL`	PostgreSQL connection string	-
`RABBITMQ_URL`	RabbitMQ connection URL	-
`QUEUE_NAME`	RabbitMQ queue name	`scraper_jobs`
`KAFKA_BOOTSTRAP_SERVICE`	Kafka broker address	-
`KAFKA_RAW_CONTENT_TOPIC`	Kafka topic for raw content	`raw_content`
`GROQ_API_KEY`	Groq API key for AI analysis	-
`ELASTICSEARCH_HOST`	Elasticsearch connection URL	-
`ELASTICSEARCH_INDEX`	Elasticsearch index name	`web_content`

🔍 Example Usage

Using the Web Interface:

Open http://localhost:5173 in your browser
Enter a URL in the submission form
Use the search functionality to find analyzed content
View detailed results with categories, summaries, and keywords

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
ai-worker/src		ai-worker/src
client		client
orchestrator		orchestrator
query-service		query-service
scraper-worker		scraper-worker
.envexample		.envexample
.gitignore		.gitignore
Readme.md		Readme.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WebScraper and Content Analysis

📖 Project Concept

🏗️ System Architecture

🛠️ Technologies Used

✨ Features

⚡ Getting Started

🔹 Setup Method : Run with Docker Compose

2. Set up Environment Variables

3. Run Backend Services in Separate Terminals

4. Run Frontend Application

📁 Project Structure

🔧 Configuration

Environment Variables

🔍 Example Usage

Using the Web Interface:

About

Uh oh!

Releases

Packages

Languages

suhas-developer07/WebscraperAndContentanalysis

Folders and files

Latest commit

History

Repository files navigation

WebScraper and Content Analysis

📖 Project Concept

🏗️ System Architecture

🛠️ Technologies Used

✨ Features

⚡ Getting Started

🔹 Setup Method : Run with Docker Compose

2. Set up Environment Variables

3. Run Backend Services in Separate Terminals

4. Run Frontend Application

📁 Project Structure

🔧 Configuration

Environment Variables

🔍 Example Usage

Using the Web Interface:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages