Spacenets.tn Distributed Scraper

A high-performance, distributed web scraping system for Spacenets.tn, built with Scrapy, Redis, and Docker. This system leverages multiple scraper nodes working in parallel to efficiently extract product data.

System Architecture

The scraping system is designed around modular, autoscaling Scrapy nodes that interact with a Redis server for distributed queue management and data sharing. Each scraper node independently fetches data and stores results.

Key Components

Scrapy Project

Includes items.py, middlewares.py, item_spider.py, and pipelines.py.

Redis Server

Central queue and item store for scraper coordination.

Redis Commander

GUI for inspecting Redis contents.

CLI Utilities

push_urls_to_redis.py: Seeds the Redis queue with initial URLs.
check_redis.py: Monitors scraper progress and stats.
export_data.py: Extracts scraped data from Redis to files.

Host Volumes

Persistent storage for logs and scraped data.

Architecture Diagram

The following diagram visualizes the internal workflow:

Diagram Highlights

item_spider.py (Scrapy spider) uses items.py definitions and middlewares.py hooks to fetch data.
Scraped items are processed via pipelines.py, and data is:
- Sent to Redis
- Stored in host-mounted logs/ and data/ directories
settings.py includes both Scrapy and Redis integration config.
CLI scripts interact directly with Redis to:
- Seed URLs (push_urls_to_redis.py)
- Poll stats (check_redis.py)
- Export data (export_data.py)
Redis Commander allows visual inspection of data queues and stored items.

Prerequisites

Docker and Docker Compose
Python 3.9+
Git

Installation

Clone the repository:

git clone https://github.com/Ibrahimghali/Spacenets.tn-Scraper.git
cd Spacenets.tn-Scraper

Create a virtual environment and install dependencies:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Create necessary directories:
```
mkdir -p logs data
```

System Architecture

The system employs a distributed architecture with the following components:

Redis Server: Central coordination and data storage
Scraper Nodes: Independent workers sharing the crawling workload
Redis Commander: Web-based UI for monitoring Redis data
Docker Compose: Orchestrates all components

graph TD
    A[Redis Server] <--> B[Scraper Node 1]
    A <--> C[Scraper Node 2]
    A <--> D[Scraper Node 3]
    E[Redis Commander] --> A
    F[User] --> E

Usage

Starting the Distributed Scraping System

Start the Docker containers:
```
cd docker
docker-compose up -d
```
Push initial URLs to Redis:
```
python -m utils.push_urls_to_redis
```

Monitor scraping progress:

python -m utils.check_redis --continuous

Stopping the System

To stop the system:

cd docker
docker-compose down

Monitoring Script

The system includes a monitoring script:

python -m utils.check_redis --continuous --debug

This provides real-time statistics, including:

Pending URLs
Visited URLs
Processed Items
Queued Requests

Exporting Data

When scraping is complete, export the data:

python -m utils.export_data

Project Structure

Spacenets.tn-Scraper/
├── docker/                     # Docker configuration
│   ├── Dockerfile
│   └── docker-compose.yml
├── logs/                       # Log files directory
├── data/                       # Data output directory
├── utils/                      # Utility scripts
│   ├── push_urls_to_redis.py   # Seeds Redis with URLs
│   ├── check_redis.py          # Monitoring script
│   └── export_data.py          # Data export script
├── spacenets/                  # Scrapy project
│   ├── spiders/
│   │   └── item_spider.py      # Main spider
│   ├── items.py                # Item definitions
│   ├── pipelines.py            # Processing pipelines
│   └── settings.py             # Scrapy settings
├── requirements.txt            # Python dependencies
└── scrapy.cfg                  # Scrapy configuration

Performance Tuning

Optimize scraper performance by adjusting these parameters in settings.py:

CONCURRENT_REQUESTS: Number of concurrent requests per node
DOWNLOAD_DELAY: Delay between requests (in seconds)
CONCURRENT_REQUESTS_PER_DOMAIN: Limit concurrent requests to the same domain

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
assets		assets
docker		docker
spacenets		spacenets
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spacenets.tn Distributed Scraper

System Architecture

Key Components

Scrapy Project

Redis Server

Redis Commander

CLI Utilities

Host Volumes

Architecture Diagram

Diagram Highlights

Prerequisites

Installation

System Architecture

Usage

Starting the Distributed Scraping System

Stopping the System

Monitoring Script

Exporting Data

Project Structure

Performance Tuning

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spacenets.tn Distributed Scraper

System Architecture

Key Components

Scrapy Project

Redis Server

Redis Commander

CLI Utilities

Host Volumes

Architecture Diagram

Diagram Highlights

Prerequisites

Installation

System Architecture

Usage

Starting the Distributed Scraping System

Stopping the System

Monitoring Script

Exporting Data

Project Structure

Performance Tuning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages