A high-performance, distributed web scraping system for Spacenets.tn, built with Scrapy, Redis, and Docker. This system leverages multiple scraper nodes working in parallel to efficiently extract product data.
The scraping system is designed around modular, autoscaling Scrapy nodes that interact with a Redis server for distributed queue management and data sharing. Each scraper node independently fetches data and stores results.
- Includes
items.py,middlewares.py,item_spider.py, andpipelines.py.
- Central queue and item store for scraper coordination.
- GUI for inspecting Redis contents.
push_urls_to_redis.py: Seeds the Redis queue with initial URLs.check_redis.py: Monitors scraper progress and stats.export_data.py: Extracts scraped data from Redis to files.
- Persistent storage for logs and scraped data.
The following diagram visualizes the internal workflow:
item_spider.py(Scrapy spider) usesitems.pydefinitions andmiddlewares.pyhooks to fetch data.- Scraped items are processed via
pipelines.py, and data is:- Sent to Redis
- Stored in host-mounted
logs/anddata/directories
settings.pyincludes both Scrapy and Redis integration config.- CLI scripts interact directly with Redis to:
- Seed URLs (
push_urls_to_redis.py) - Poll stats (
check_redis.py) - Export data (
export_data.py)
- Seed URLs (
- Redis Commander allows visual inspection of data queues and stored items.
- Docker and Docker Compose
- Python 3.9+
- Git
-
Clone the repository:
git clone https://github.com/Ibrahimghali/Spacenets.tn-Scraper.git cd Spacenets.tn-Scraper -
Create a virtual environment and install dependencies:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate pip install -r requirements.txt
-
Create necessary directories:
mkdir -p logs data
The system employs a distributed architecture with the following components:
- Redis Server: Central coordination and data storage
- Scraper Nodes: Independent workers sharing the crawling workload
- Redis Commander: Web-based UI for monitoring Redis data
- Docker Compose: Orchestrates all components
graph TD
A[Redis Server] <--> B[Scraper Node 1]
A <--> C[Scraper Node 2]
A <--> D[Scraper Node 3]
E[Redis Commander] --> A
F[User] --> E
-
Start the Docker containers:
cd docker docker-compose up -d -
Push initial URLs to Redis:
python -m utils.push_urls_to_redis
-
Monitor scraping progress:
python -m utils.check_redis --continuous
To stop the system:
cd docker
docker-compose downThe system includes a monitoring script:
python -m utils.check_redis --continuous --debugThis provides real-time statistics, including:
- Pending URLs
- Visited URLs
- Processed Items
- Queued Requests
When scraping is complete, export the data:
python -m utils.export_dataSpacenets.tn-Scraper/
├── docker/ # Docker configuration
│ ├── Dockerfile
│ └── docker-compose.yml
├── logs/ # Log files directory
├── data/ # Data output directory
├── utils/ # Utility scripts
│ ├── push_urls_to_redis.py # Seeds Redis with URLs
│ ├── check_redis.py # Monitoring script
│ └── export_data.py # Data export script
├── spacenets/ # Scrapy project
│ ├── spiders/
│ │ └── item_spider.py # Main spider
│ ├── items.py # Item definitions
│ ├── pipelines.py # Processing pipelines
│ └── settings.py # Scrapy settings
├── requirements.txt # Python dependencies
└── scrapy.cfg # Scrapy configuration
Optimize scraper performance by adjusting these parameters in settings.py:
CONCURRENT_REQUESTS: Number of concurrent requests per nodeDOWNLOAD_DELAY: Delay between requests (in seconds)CONCURRENT_REQUESTS_PER_DOMAIN: Limit concurrent requests to the same domain
