This project is a Python-based web crawler designed for scalability, flexibility, and seamless integration with a Supabase database. It features an asynchronous manager-worker architecture, advanced content-based tagging, and a wide range of configuration options to control the crawling process.
- Asynchronous Crawling: Utilizes
asyncioandhttpxfor high-performance, concurrent crawling. - Supabase Integration: Seamlessly connects to a Supabase database to persist crawled URLs, extracted data, and their associated tags.
- Advanced Content-Based Tagging:
- URL Regex: Tag pages based on regular expression patterns in the URL.
- Keyword Matching: Tag pages based on the presence of keywords in the title or body text.
- CSS Selectors: Tag pages based on the presence of CSS selectors.
- Scored Tagging: A simple scoring mechanism to weigh the relevance of tags from different sources.
- Targeted Crawling:
- Depth Limiting: Limit the crawl depth with
--max-depth. - Path Filtering: Include or exclude specific URL paths from the crawl.
- Subdomain Control: Control whether the crawler is allowed to visit subdomains.
- Depth Limiting: Limit the crawl depth with
- Flexible Configuration:
- YAML Configuration File: Configure the crawler using a YAML file for easy management of settings.
- Command-Line Overrides: Override settings from the configuration file with command-line arguments.
- Automation:
- Scheduled Crawls: Schedule crawls to run at regular intervals using cron-like expressions or simple intervals.
- Continuous Crawling: Run the crawler in a continuous mode to constantly monitor for new and updated content.
- RSS Feed Discovery: Automatically discover new seed URLs from RSS feeds.
- Monitoring and Health Checks:
- Health Check Endpoint: A
/healthendpoint to monitor the status of the crawler. - Structured Logging: Structured logging with
structlogfor easier parsing and analysis of logs.
- Health Check Endpoint: A
- ML-Powered Tagging (Optional): Uses a local, zero-shot classification model to automatically tag webpages based on their content.
- Python 3.10+
- Pip
-
Clone the repository:
git clone <repository-url> cd spider
-
Install dependencies:
pip install -r requirements.txt
-
Configure environment variables:
Create a
.envfile in the root directory of the project by copying the.env.examplefile. The following variables are available:SUPABASE_URL: The URL of your Supabase project.SUPABASE_KEY: The API key for your Supabase project.USE_SUPABASE: Set totrueto enable Supabase integration.USE_ML_TAGGING: Set totrueto enable ML-based tagging.
The crawler can be run in several modes:
To run a single crawl, provide a starting URL:
python main.py http://example.comTo schedule a crawl, use the --schedule argument with a cron-like expression or a simple interval:
# Run once a day at midnight
python main.py --schedule "0 0 * * *" --start-url http://example.com
# Run every 6 hours
python main.py --schedule "6h" --start-url http://example.comTo run the crawler in continuous mode, use the --continuous flag:
python main.py --continuous --start-url http://example.comThe crawler can be configured using a YAML file. Create a config.yml file and use the --config argument to specify its path. Command-line arguments will override the settings from the file.
Example config.yml:
max_workers: 8
max_depth: 5
crawl_subdomains: true
log_level: DEBUG
tag_rules:
- type: keyword
pattern: python
tag: python
- type: css_selector
pattern: ".post-tag"
tag: blog-post--config: Path to a YAML configuration file.--max-workers: The maximum number of worker processes to use.--max-depth: The maximum crawl depth.--include-path: Path prefix to include in the crawl.--exclude-path: Path prefix to exclude from the crawl.--crawl-subdomains: Allow crawling of subdomains.--tag-rule: A JSON string representing a tagging rule. e.g.,'{"type": "keyword", "pattern": "python", "tag": "python"}'.--rss-feed: URL of an RSS feed to use for seed URLs.--schedule: Cron-like expression or simple interval for scheduling crawls.--continuous: Run the crawler in continuous mode.--continuous-interval: Interval in seconds for continuous crawling.--health-check-port: Port for the health check server.--autoscale: Enable auto-scaling logic.--log-level: Set the log level (e.g., DEBUG, INFO, WARNING, ERROR, CRITICAL).--log-file: Path to a file to log to.--export-csv: Export crawled data to a CSV file.
The crawler follows an asynchronous manager-worker architecture:
CrawlManager: The central component responsible for managing the crawl. It maintains a queue of URLs to be crawled, distributes tasks to worker tasks, and handles the overall crawl logic.CrawlWorker: An asynchronous worker that receives a URL from the manager, fetches its content, extracts new links, and extracts data and tags.SupabaseClient: A client for interacting with the Supabase database.Tagger: A class responsible for applying tagging rules to the crawled content.Scheduler: A module for scheduling and running crawls.HealthCheck: A FastAPI application for monitoring the crawler's status.
- Coding Style: The project follows the PEP 8 style guide for Python code.
- Testing: The project uses
pytestfor testing. Tests are located in thetestsdirectory. - Database: The project uses Supabase for database storage.
- Dependencies: Project dependencies are managed in the
requirements.txtfile.