Skip to content

TheForgivenOne/Spider

Repository files navigation

Spider: A Scalable and Feature-Rich Web Crawler

This project is a Python-based web crawler designed for scalability, flexibility, and seamless integration with a Supabase database. It features an asynchronous manager-worker architecture, advanced content-based tagging, and a wide range of configuration options to control the crawling process.

Features

  • Asynchronous Crawling: Utilizes asyncio and httpx for high-performance, concurrent crawling.
  • Supabase Integration: Seamlessly connects to a Supabase database to persist crawled URLs, extracted data, and their associated tags.
  • Advanced Content-Based Tagging:
    • URL Regex: Tag pages based on regular expression patterns in the URL.
    • Keyword Matching: Tag pages based on the presence of keywords in the title or body text.
    • CSS Selectors: Tag pages based on the presence of CSS selectors.
    • Scored Tagging: A simple scoring mechanism to weigh the relevance of tags from different sources.
  • Targeted Crawling:
    • Depth Limiting: Limit the crawl depth with --max-depth.
    • Path Filtering: Include or exclude specific URL paths from the crawl.
    • Subdomain Control: Control whether the crawler is allowed to visit subdomains.
  • Flexible Configuration:
    • YAML Configuration File: Configure the crawler using a YAML file for easy management of settings.
    • Command-Line Overrides: Override settings from the configuration file with command-line arguments.
  • Automation:
    • Scheduled Crawls: Schedule crawls to run at regular intervals using cron-like expressions or simple intervals.
    • Continuous Crawling: Run the crawler in a continuous mode to constantly monitor for new and updated content.
    • RSS Feed Discovery: Automatically discover new seed URLs from RSS feeds.
  • Monitoring and Health Checks:
    • Health Check Endpoint: A /health endpoint to monitor the status of the crawler.
    • Structured Logging: Structured logging with structlog for easier parsing and analysis of logs.
  • ML-Powered Tagging (Optional): Uses a local, zero-shot classification model to automatically tag webpages based on their content.

Getting Started

Prerequisites

  • Python 3.10+
  • Pip

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd spider
  2. Install dependencies:

    pip install -r requirements.txt
  3. Configure environment variables:

    Create a .env file in the root directory of the project by copying the .env.example file. The following variables are available:

    • SUPABASE_URL: The URL of your Supabase project.
    • SUPABASE_KEY: The API key for your Supabase project.
    • USE_SUPABASE: Set to true to enable Supabase integration.
    • USE_ML_TAGGING: Set to true to enable ML-based tagging.

Usage

The crawler can be run in several modes:

One-off Crawl

To run a single crawl, provide a starting URL:

python main.py http://example.com

Scheduled Crawl

To schedule a crawl, use the --schedule argument with a cron-like expression or a simple interval:

# Run once a day at midnight
python main.py --schedule "0 0 * * *" --start-url http://example.com

# Run every 6 hours
python main.py --schedule "6h" --start-url http://example.com

Continuous Crawling

To run the crawler in continuous mode, use the --continuous flag:

python main.py --continuous --start-url http://example.com

Configuration

The crawler can be configured using a YAML file. Create a config.yml file and use the --config argument to specify its path. Command-line arguments will override the settings from the file.

Example config.yml:

max_workers: 8
max_depth: 5
crawl_subdomains: true
log_level: DEBUG
tag_rules:
  - type: keyword
    pattern: python
    tag: python
  - type: css_selector
    pattern: ".post-tag"
    tag: blog-post

Optional Arguments

  • --config: Path to a YAML configuration file.
  • --max-workers: The maximum number of worker processes to use.
  • --max-depth: The maximum crawl depth.
  • --include-path: Path prefix to include in the crawl.
  • --exclude-path: Path prefix to exclude from the crawl.
  • --crawl-subdomains: Allow crawling of subdomains.
  • --tag-rule: A JSON string representing a tagging rule. e.g., '{"type": "keyword", "pattern": "python", "tag": "python"}'.
  • --rss-feed: URL of an RSS feed to use for seed URLs.
  • --schedule: Cron-like expression or simple interval for scheduling crawls.
  • --continuous: Run the crawler in continuous mode.
  • --continuous-interval: Interval in seconds for continuous crawling.
  • --health-check-port: Port for the health check server.
  • --autoscale: Enable auto-scaling logic.
  • --log-level: Set the log level (e.g., DEBUG, INFO, WARNING, ERROR, CRITICAL).
  • --log-file: Path to a file to log to.
  • --export-csv: Export crawled data to a CSV file.

Development

Architecture

The crawler follows an asynchronous manager-worker architecture:

  • CrawlManager: The central component responsible for managing the crawl. It maintains a queue of URLs to be crawled, distributes tasks to worker tasks, and handles the overall crawl logic.
  • CrawlWorker: An asynchronous worker that receives a URL from the manager, fetches its content, extracts new links, and extracts data and tags.
  • SupabaseClient: A client for interacting with the Supabase database.
  • Tagger: A class responsible for applying tagging rules to the crawled content.
  • Scheduler: A module for scheduling and running crawls.
  • HealthCheck: A FastAPI application for monitoring the crawler's status.

Conventions

  • Coding Style: The project follows the PEP 8 style guide for Python code.
  • Testing: The project uses pytest for testing. Tests are located in the tests directory.
  • Database: The project uses Supabase for database storage.
  • Dependencies: Project dependencies are managed in the requirements.txt file.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages