Spider: A Scalable and Feature-Rich Web Crawler

This project is a Python-based web crawler designed for scalability, flexibility, and seamless integration with a Supabase database. It features an asynchronous manager-worker architecture, advanced content-based tagging, and a wide range of configuration options to control the crawling process.

Features

Asynchronous Crawling: Utilizes asyncio and httpx for high-performance, concurrent crawling.
Supabase Integration: Seamlessly connects to a Supabase database to persist crawled URLs, extracted data, and their associated tags.
Advanced Content-Based Tagging:
- URL Regex: Tag pages based on regular expression patterns in the URL.
- Keyword Matching: Tag pages based on the presence of keywords in the title or body text.
- CSS Selectors: Tag pages based on the presence of CSS selectors.
- Scored Tagging: A simple scoring mechanism to weigh the relevance of tags from different sources.
Targeted Crawling:
- Depth Limiting: Limit the crawl depth with --max-depth.
- Path Filtering: Include or exclude specific URL paths from the crawl.
- Subdomain Control: Control whether the crawler is allowed to visit subdomains.
Flexible Configuration:
- YAML Configuration File: Configure the crawler using a YAML file for easy management of settings.
- Command-Line Overrides: Override settings from the configuration file with command-line arguments.
Automation:
- Scheduled Crawls: Schedule crawls to run at regular intervals using cron-like expressions or simple intervals.
- Continuous Crawling: Run the crawler in a continuous mode to constantly monitor for new and updated content.
- RSS Feed Discovery: Automatically discover new seed URLs from RSS feeds.
Monitoring and Health Checks:
- Health Check Endpoint: A /health endpoint to monitor the status of the crawler.
- Structured Logging: Structured logging with structlog for easier parsing and analysis of logs.
ML-Powered Tagging (Optional): Uses a local, zero-shot classification model to automatically tag webpages based on their content.

Getting Started

Prerequisites

Python 3.10+
Pip

Installation

Clone the repository:
```
git clone <repository-url>
cd spider
```
Install dependencies:
```
pip install -r requirements.txt
```
Configure environment variables:

Create a .env file in the root directory of the project by copying the .env.example file. The following variables are available:
- SUPABASE_URL: The URL of your Supabase project.
- SUPABASE_KEY: The API key for your Supabase project.
- USE_SUPABASE: Set to true to enable Supabase integration.
- USE_ML_TAGGING: Set to true to enable ML-based tagging.

Usage

The crawler can be run in several modes:

One-off Crawl

To run a single crawl, provide a starting URL:

python main.py http://example.com

Scheduled Crawl

To schedule a crawl, use the --schedule argument with a cron-like expression or a simple interval:

# Run once a day at midnight
python main.py --schedule "0 0 * * *" --start-url http://example.com

# Run every 6 hours
python main.py --schedule "6h" --start-url http://example.com

Continuous Crawling

To run the crawler in continuous mode, use the --continuous flag:

python main.py --continuous --start-url http://example.com

Configuration

The crawler can be configured using a YAML file. Create a config.yml file and use the --config argument to specify its path. Command-line arguments will override the settings from the file.

Example config.yml:

max_workers: 8
max_depth: 5
crawl_subdomains: true
log_level: DEBUG
tag_rules:
  - type: keyword
    pattern: python
    tag: python
  - type: css_selector
    pattern: ".post-tag"
    tag: blog-post

Optional Arguments

--config: Path to a YAML configuration file.
--max-workers: The maximum number of worker processes to use.
--max-depth: The maximum crawl depth.
--include-path: Path prefix to include in the crawl.
--exclude-path: Path prefix to exclude from the crawl.
--crawl-subdomains: Allow crawling of subdomains.
--tag-rule: A JSON string representing a tagging rule. e.g., '{"type": "keyword", "pattern": "python", "tag": "python"}'.
--rss-feed: URL of an RSS feed to use for seed URLs.
--schedule: Cron-like expression or simple interval for scheduling crawls.
--continuous: Run the crawler in continuous mode.
--continuous-interval: Interval in seconds for continuous crawling.
--health-check-port: Port for the health check server.
--autoscale: Enable auto-scaling logic.
--log-level: Set the log level (e.g., DEBUG, INFO, WARNING, ERROR, CRITICAL).
--log-file: Path to a file to log to.
--export-csv: Export crawled data to a CSV file.

Development

Architecture

The crawler follows an asynchronous manager-worker architecture:

CrawlManager: The central component responsible for managing the crawl. It maintains a queue of URLs to be crawled, distributes tasks to worker tasks, and handles the overall crawl logic.
CrawlWorker: An asynchronous worker that receives a URL from the manager, fetches its content, extracts new links, and extracts data and tags.
SupabaseClient: A client for interacting with the Supabase database.
Tagger: A class responsible for applying tagging rules to the crawled content.
Scheduler: A module for scheduling and running crawls.
HealthCheck: A FastAPI application for monitoring the crawler's status.

Conventions

Coding Style: The project follows the PEP 8 style guide for Python code.
Testing: The project uses pytest for testing. Tests are located in the tests directory.
Database: The project uses Supabase for database storage.
Dependencies: Project dependencies are managed in the requirements.txt file.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
database		database
spider		spider
tests		tests
.env		.env
.env.example		.env.example
.gitignore		.gitignore
GEMINI.md		GEMINI.md
README.md		README.md
config.py		config.py
config_loader.py		config_loader.py
crawl_queue.db		crawl_queue.db
feed_discovery.py		feed_discovery.py
health_check.py		health_check.py
logging_config.py		logging_config.py
main.py		main.py
requirements.txt		requirements.txt
scaler.py		scaler.py
scheduler.py		scheduler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spider: A Scalable and Feature-Rich Web Crawler

Features

Getting Started

Prerequisites

Installation

Usage

One-off Crawl

Scheduled Crawl

Continuous Crawling

Configuration

Optional Arguments

Development

Architecture

Conventions

About

Uh oh!

Releases

Packages

Languages

TheForgivenOne/Spider

Folders and files

Latest commit

History

Repository files navigation

Spider: A Scalable and Feature-Rich Web Crawler

Features

Getting Started

Prerequisites

Installation

Usage

One-off Crawl

Scheduled Crawl

Continuous Crawling

Configuration

Optional Arguments

Development

Architecture

Conventions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages