Web Scraping and Analysis Service

This project implements a distributed web scraping and analysis service using RabbitMQ, TypeScript (for the producer), and Python (for the consumer). It captures screenshots of websites, analyzes them using OpenAI's GPT-4 Vision model, and stores the results in a PostgreSQL database.

Project Overview

This service automates the process of capturing website screenshots, extracting information from these images using AI, and storing the analysis results. It's designed to be scalable and can handle a large number of URLs efficiently.

Components

Producer (TypeScript): Captures screenshots of websites and sends them to a RabbitMQ queue.
Consumer (Python): Processes screenshots from the queue, analyzes them using OpenAI's GPT-4 Vision API, and stores results in a PostgreSQL database.
RabbitMQ: Message broker for coordinating work between the producer and consumer.
PostgreSQL: Database for storing analysis results.

Prerequisites

Node.js and npm
Python 3.7+
RabbitMQ
PostgreSQL
OpenAI API key

Setup

Clone the repository:

git clone https://github.com/yourusername/web-scraping-analysis-service.git
cd web-scraping-analysis-service

Set up the producer:
```
cd producer
npm install
```

Set up the consumer:

cd ../consumer
python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`
pip install -r requirements.txt

Set up the database:

Create a PostgreSQL database

Run the SQL script to create the necessary table:

CREATE TABLE website_analyses (
    id SERIAL PRIMARY KEY,
    url TEXT NOT NULL,
    analysis JSONB NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Configure environment variables: Create a .env file in both the producer and consumer directories with the following contents:

DB_NAME=your_database_name
DB_USER=your_database_user
DB_PASSWORD=your_database_password
DB_HOST=localhost
DB_PORT=5432
OPENAI_API_KEY=your_openai_api_key

Usage

Start RabbitMQ: Ensure RabbitMQ is running on your system.
Run the producer:
```
cd producer
npm start
```
Run the consumer:
```
cd consumer
python consumer.py
```
Add URLs to process: Use the provided script to add URLs to the queue:
```
python add_urls.py
```

Architecture

[Producer (TS)] --> [RabbitMQ] --> [Consumer (Python)] --> [PostgreSQL]
     |                                       |
     v                                       v
[Website Screenshots]                 [OpenAI GPT-4 Vision API]

Troubleshooting

Ensure all environment variables are correctly set.
Check RabbitMQ and PostgreSQL logs for any connection issues.
Verify that the OpenAI API key is valid and has sufficient credits.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Getting Started

For first-time users, follow these steps to get the project up and running:

Install Dependencies:
- Ensure you have Node.js, Python, RabbitMQ, and PostgreSQL installed on your system.
- Install the required Node.js packages:
```
cd producer
npm install
```
- Install the required Python packages:
```
cd consumer
pip install -r requirements.txt
```
Set Up the Database:
- Create a new PostgreSQL database for the project.
- Use the provided SQL script to create the necessary table.
Configure Environment:
- Create .env files in both the producer and consumer directories.
- Add all required environment variables as listed in the Setup section.
Start the Services:
- Start RabbitMQ (the method depends on your installation).
- Run the producer: npm start in the producer directory.
- Run the consumer: python consumer.py in the consumer directory.
Add URLs:
- Use the add_urls.py script to add some test URLs to the queue.
Monitor the Process:
- Watch the console output of both the producer and consumer.
- Check the PostgreSQL database for incoming results.

Example Output

Here's an example of what the analysis output might look like:

{
  "url": "https://example.com",
  "analysis": {
    "main_content": "Welcome to Example.com",
    "key_features": ["Simple design", "Clear navigation", "Informative content"],
    "improvement_suggestions": [
      "Add a call-to-action button",
      "Implement a responsive design",
      "Include more visual elements"
    ]
  }
}

Scaling for Production

To scale this system for production use:

Containerization: Use Docker to containerize each component.
Load Balancing: Implement a load balancer for multiple producer and consumer instances.
Database Optimization: Consider database sharding or read replicas for high traffic.
Monitoring: Implement comprehensive monitoring and alerting systems.
Error Handling: Enhance error handling and implement a dead-letter queue for failed jobs.

Limitations and Future Improvements

Currently limited by OpenAI API rate limits and costs.
Could implement caching to avoid re-analyzing recently processed URLs.
Potential for adding more diverse analysis models or services.

How to Contribute

Fork the repository.
Create a new branch for your feature: git checkout -b feature-name.
Make your changes and commit them: git commit -m 'Add some feature'.
Push to the branch: git push origin feature-name.
Submit a pull request.

Please adhere to the project's coding standards and include tests for new features.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
consumer		consumer
producer		producer
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping and Analysis Service

Table of Contents

Project Overview

Components

Prerequisites

Setup

Usage

Architecture

Troubleshooting

Contributing

License

Getting Started

Example Output

Scaling for Production

Limitations and Future Improvements

How to Contribute

About

Uh oh!

Releases

Packages

Languages

Daniishkhan/rabbimq-scraper

Folders and files

Latest commit

History

Repository files navigation

Web Scraping and Analysis Service

Table of Contents

Project Overview

Components

Prerequisites

Setup

Usage

Architecture

Troubleshooting

Contributing

License

Getting Started

Example Output

Scaling for Production

Limitations and Future Improvements

How to Contribute

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages