A robust, scalable web scraper with comprehensive error handling, multiple output formats, and both synchronous and asynchronous scraping capabilities.
- HTTP Requests: Robust request handling with retries and timeouts
- HTML Parsing: BeautifulSoup with CSS selectors and XPath support
- Error Handling: Comprehensive error handling and logging
- Rate Limiting: Configurable delays between requests
- Pagination Support: Automatic pagination handling
- Multiple Output Formats: CSV, JSON, and Excel support
- Async Scraping: High-performance concurrent scraping
- Robots.txt Compliance: Automatic robots.txt checking
- User-Agent Rotation: Fake user agent generation
- Proxy Support: Ready for proxy integration
- Configuration-Driven: YAML-based configuration
- Modular Architecture: Clean, maintainable code structure
- Clone the repository:
git clone <repository-url>
cd web-scraper
- Install dependencies:
pip install -r requirements.txt
Edit config.yaml
to configure your scraping targets:
scraper:
timeout: 30
max_retries: 3
rate_limit: 1 # seconds between requests
targets:
my_target:
base_url: "https://example.com"
selectors:
title: "h1.title"
content: "div.content"
pagination:
enabled: true
next_button: "a.next"
max_pages: 10
python main.py --target my_target
python main.py --url "https://example.com" --selectors '{"title": "h1", "price": ".price"}'
python main.py --urls-file urls.txt --selectors '{"title": "h1"}' --async
python main.py
from src.scraper import WebScraper
# Initialize scraper
scraper = WebScraper('config.yaml')
# Scrape a configured target
data = scraper.scrape_target('my_target')
# Scrape custom URL
selectors = {'title': 'h1', 'price': '.price'}
data = scraper.scrape_custom_url('https://example.com', selectors)
# Async scraping
urls = ['https://example1.com', 'https://example2.com']
data = scraper.run_async_scrape(urls, selectors)
# Save data
scraper.save_data(data, 'my_output')
scraper.close()
# Configure in config.yaml
output:
format: "csv"
filename: "scraped_data"
include_timestamp: true
output:
format: "json"
filename: "scraped_data"
output:
format: "xlsx"
filename: "scraped_data"
The scraper includes comprehensive error handling:
- Request Failures: Automatic retries with exponential backoff
- Parsing Errors: Graceful handling of missing elements
- Network Issues: Timeout handling and connection error recovery
- Data Validation: URL validation and content type checking
- Logging: Detailed logging to file and console
- Configurable Delays: Set delays between requests
- Robots.txt Compliance: Automatic robots.txt checking
- User-Agent Rotation: Avoid detection with rotating user agents
- Respectful Scraping: Built-in safeguards for ethical scraping
For high-performance scraping of multiple URLs:
# Scrape 100 URLs concurrently
urls = ['https://example.com/page{}'.format(i) for i in range(1, 101)]
data = scraper.run_async_scrape(urls, selectors)
Extend the scraper with custom processing:
class CustomScraper(WebScraper):
def custom_parse_data(self, html, selectors):
data = super().parse_data(html, selectors)
# Add custom processing
data['processed_at'] = time.time()
return data
Automatic pagination support:
pagination:
enabled: true
next_button: "a.next-page"
max_pages: 50
Logs are written to scraper.log
and console:
2024-01-15 10:30:15 - WebScraper - INFO - Fetching: https://example.com
2024-01-15 10:30:16 - WebScraper - INFO - Successfully scraped 25 items
web-scraper/
├── src/
│ ├── __init__.py
│ ├── scraper.py # Main scraper class
│ ├── async_scraper.py # Async scraping functionality
│ ├── storage.py # Data storage handlers
│ └── utils.py # Utility functions
├── config.yaml # Configuration file
├── main.py # CLI entry point
├── requirements.txt # Dependencies
└── README.md # This file
- Respect robots.txt: The scraper automatically checks robots.txt
- Rate Limiting: Always use appropriate delays between requests
- Terms of Service: Review website terms before scraping
- Copyright: Respect intellectual property rights
- Personal Data: Handle personal data according to privacy laws
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
For issues and questions:
- Check the logs in
scraper.log
- Review the configuration in
config.yaml
- Open an issue on GitHub
targets:
ecommerce:
base_url: "https://shop.example.com/products"
selectors:
name: "h2.product-name"
price: "span.price"
rating: "div.rating"
availability: "span.stock"
targets:
news:
base_url: "https://news.example.com"
selectors:
headline: "h1.headline"
author: "span.author"
date: "time.publish-date"
content: "div.article-body"
targets:
jobs:
base_url: "https://jobs.example.com"
selectors:
title: "h3.job-title"
company: "span.company-name"
location: "span.location"
salary: "span.salary"