WebScraper and Content Analysis is a microservice-based system designed to scrape, analyze, and make web content searchable.
Users submit URLs to be scraped. The system then:
- Scrapes and cleans the content from webpages.
- Analyzes the text using AI (LLM API) to categorize articles, summarize them, and extract keywords.
- Stores enriched results in Elasticsearch for fast and powerful querying.
This architecture is designed with scalability, fault tolerance, and clean separation of responsibilities in mind.
The project is built around event-driven microservices connected by message queues and streaming platforms.
- Orchestrator (Go): Accepts scraping jobs, stores metadata in PostgreSQL, splits jobs into tasks, and queues them in RabbitMQ.
- Scraper Workers (Go): Consume tasks, fetch HTML, clean text, and publish raw content into Kafka.
- AI Analysis Worker (Python): Consumes Kafka messages, sends text to Groq LLM API for categorization, summarization, and keyword extraction, then stores results in Elasticsearch.
- Query Service (Go): Provides REST API endpoints to query the enriched content from Elasticsearch.
WebScraper and Content Analysis is built using the following technologies:
- Go (Golang) → For Orchestrator, Scraper, and Query services. Provides high performance and concurrency for distributed tasks.
- Python → For AI Analysis Worker. Simplifies integration with Groq LLM API.
- React → Modern frontend framework for building responsive user interfaces.
- PostgreSQL → Stores job metadata including job ID, URL, and status.
- RabbitMQ → A reliable message broker used to distribute scraping tasks to workers.
- Kafka → A streaming platform used for handling large-scale raw content pipelines.
- Elasticsearch → Full-text search engine storing enriched content and enabling fast queries.
- Groq LLM API → Provides AI-based summarization, categorization, and keyword extraction.
- Docker & Docker Compose → For containerized deployment of databases, brokers, and services.
- Modern Web Interface:clean, responsive React frontend for easy interaction
- Job Orchestration: Submit jobs with URLs, track their progress in PostgreSQL.
- Scalable Scraping: Multiple workers can scrape URLs concurrently.
- AI Enrichment: Articles are categorized, summarized into 3 bullet points, and keywords extracted.
- Full-Text Search: Search enriched data stored in Elasticsearch via REST API.
- Resilient & Modular: Each service can be scaled independently.
This is the easiest way to get all dependencies (PostgreSQL, RabbitMQ, Kafka, Elasticsearch) and services up and running.
-
Clone the repository:
git clone https://github.com/your-username/webscraperandcontentanalysis.git cd webscraperandcontentanalysis -
Set up environment variables:
cp .env.example .env
Edit the
.envfile and add your configuration:# Database DATABASE_URL=postgresql://user:password@postgres:5432/webscraper?sslmode=disable # RabbitMQ RABBITMQ_URL=amqp://user:password@rabbitmq:5672/ QUEUE_NAME=scraper_jobs # Kafka KAFKA_BOOTSTRAP_SERVICE=kafka:9092 KAFKA_RAW_CONTENT_TOPIC=raw_content # Groq API GROQ_API_KEY=your_groq_api_key_here # Elasticsearch ELASTICSEARCH_HOST=http://elasticsearch:9200 ELASTICSEARCH_INDEX=web_content
-
Run with Docker Compose:
docker-compose up -d
-
Verify services are running:
docker-compose ps
export DATABASE_URL="postgresql://user:password@localhost:5432/webscraper?sslmode=disable"
export RABBITMQ_URL="amqp://user:password@localhost:5672/"
export QUEUE_NAME="scraper_jobs"
export KAFKA_BOOTSTRAP_SERVICE="localhost:9092"
export KAFKA_RAW_CONTENT_TOPIC="raw_content"
export GROQ_API_KEY="your_groq_api_key_here"
export ELASTICSEARCH_HOST="http://localhost:9200"
export ELASTICSEARCH_INDEX="web_content"Terminal 1 - Orchestrator:
go run -C orchestrator cmdTerminal 2 - Scraper Worker:
go run -C scraper-worker cmd/workerTerminal 3 - AI Analysis Worker:
cd ai-worker
pip install -r requirements.txt
python3 main.pyTerminal 4 - Query Service:
go run -C query-service cmd/main.goTerminal 5 - React Frontend:
cd client
npm install
npm run devThe frontend will be available at http://localhost:5173
webscraperandcontentanalysis/
├── client/ # React + Vite frontend application
│ ├── src/ # Source code
│ ├── public/ # Static assets
│ └── package.json # Frontend dependencies
├── orchestrator/ # Go service for job management
├── scraper-worker/ # Go service for web scraping
├── ai-worker/ # Python service for AI analysis
├── query-service/ # Go service for Elasticsearch queries
├── docker-compose.yml # Docker configuration
├── .env.example # Environment template
└── README.md # This file
| Variable | Description | Default |
|---|---|---|
DATABASE_URL |
PostgreSQL connection string | - |
RABBITMQ_URL |
RabbitMQ connection URL | - |
QUEUE_NAME |
RabbitMQ queue name | scraper_jobs |
KAFKA_BOOTSTRAP_SERVICE |
Kafka broker address | - |
KAFKA_RAW_CONTENT_TOPIC |
Kafka topic for raw content | raw_content |
GROQ_API_KEY |
Groq API key for AI analysis | - |
ELASTICSEARCH_HOST |
Elasticsearch connection URL | - |
ELASTICSEARCH_INDEX |
Elasticsearch index name | web_content |
- Open http://localhost:5173 in your browser
- Enter a URL in the submission form
- Use the search functionality to find analyzed content
- View detailed results with categories, summaries, and keywords