This project implements a distributed web scraping and analysis service using RabbitMQ, TypeScript (for the producer), and Python (for the consumer). It captures screenshots of websites, analyzes them using OpenAI's GPT-4 Vision model, and stores the results in a PostgreSQL database.
- Project Overview
- Components
- Prerequisites
- Setup
- Usage
- Architecture
- Troubleshooting
- Contributing
- License
This service automates the process of capturing website screenshots, extracting information from these images using AI, and storing the analysis results. It's designed to be scalable and can handle a large number of URLs efficiently.
- Producer (TypeScript): Captures screenshots of websites and sends them to a RabbitMQ queue.
- Consumer (Python): Processes screenshots from the queue, analyzes them using OpenAI's GPT-4 Vision API, and stores results in a PostgreSQL database.
- RabbitMQ: Message broker for coordinating work between the producer and consumer.
- PostgreSQL: Database for storing analysis results.
- Node.js and npm
- Python 3.7+
- RabbitMQ
- PostgreSQL
- OpenAI API key
-
Clone the repository:
git clone https://github.com/yourusername/web-scraping-analysis-service.git cd web-scraping-analysis-service -
Set up the producer:
cd producer npm install -
Set up the consumer:
cd ../consumer python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate` pip install -r requirements.txt
-
Set up the database:
- Create a PostgreSQL database
- Run the SQL script to create the necessary table:
CREATE TABLE website_analyses ( id SERIAL PRIMARY KEY, url TEXT NOT NULL, analysis JSONB NOT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP );
-
Configure environment variables: Create a
.envfile in both the producer and consumer directories with the following contents:DB_NAME=your_database_name DB_USER=your_database_user DB_PASSWORD=your_database_password DB_HOST=localhost DB_PORT=5432 OPENAI_API_KEY=your_openai_api_key
-
Start RabbitMQ: Ensure RabbitMQ is running on your system.
-
Run the producer:
cd producer npm start -
Run the consumer:
cd consumer python consumer.py -
Add URLs to process: Use the provided script to add URLs to the queue:
python add_urls.py
[Producer (TS)] --> [RabbitMQ] --> [Consumer (Python)] --> [PostgreSQL]
| |
v v
[Website Screenshots] [OpenAI GPT-4 Vision API]
- Ensure all environment variables are correctly set.
- Check RabbitMQ and PostgreSQL logs for any connection issues.
- Verify that the OpenAI API key is valid and has sufficient credits.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
For first-time users, follow these steps to get the project up and running:
-
Install Dependencies:
- Ensure you have Node.js, Python, RabbitMQ, and PostgreSQL installed on your system.
- Install the required Node.js packages:
cd producer npm install - Install the required Python packages:
cd consumer pip install -r requirements.txt
-
Set Up the Database:
- Create a new PostgreSQL database for the project.
- Use the provided SQL script to create the necessary table.
-
Configure Environment:
- Create
.envfiles in both the producer and consumer directories. - Add all required environment variables as listed in the Setup section.
- Create
-
Start the Services:
- Start RabbitMQ (the method depends on your installation).
- Run the producer:
npm startin the producer directory. - Run the consumer:
python consumer.pyin the consumer directory.
-
Add URLs:
- Use the
add_urls.pyscript to add some test URLs to the queue.
- Use the
-
Monitor the Process:
- Watch the console output of both the producer and consumer.
- Check the PostgreSQL database for incoming results.
Here's an example of what the analysis output might look like:
{
"url": "https://example.com",
"analysis": {
"main_content": "Welcome to Example.com",
"key_features": ["Simple design", "Clear navigation", "Informative content"],
"improvement_suggestions": [
"Add a call-to-action button",
"Implement a responsive design",
"Include more visual elements"
]
}
}To scale this system for production use:
- Containerization: Use Docker to containerize each component.
- Load Balancing: Implement a load balancer for multiple producer and consumer instances.
- Database Optimization: Consider database sharding or read replicas for high traffic.
- Monitoring: Implement comprehensive monitoring and alerting systems.
- Error Handling: Enhance error handling and implement a dead-letter queue for failed jobs.
- Currently limited by OpenAI API rate limits and costs.
- Could implement caching to avoid re-analyzing recently processed URLs.
- Potential for adding more diverse analysis models or services.
- Fork the repository.
- Create a new branch for your feature:
git checkout -b feature-name. - Make your changes and commit them:
git commit -m 'Add some feature'. - Push to the branch:
git push origin feature-name. - Submit a pull request.
Please adhere to the project's coding standards and include tests for new features.