Skip to content

Latest commit

 

History

History
235 lines (163 loc) · 6.65 KB

File metadata and controls

235 lines (163 loc) · 6.65 KB

stygian

stygian

High-performance web scraping toolkit for Rust — graph-based execution engine + anti-detection browser automation.

CI Security Audit Documentation OpenSSF Scorecard License: AGPL v3


What is stygian?

Stygian is a monorepo containing two complementary Rust crates for building robust, scalable web scraping systems:

Graph-based scraping engine treating pipelines as DAGs with pluggable service modules:

  • Hexagonal architecture — domain core isolated from infrastructure
  • Extreme concurrency — Tokio for I/O, Rayon for CPU-bound tasks
  • AI extraction — Claude, GPT, Gemini, GitHub Copilot, Ollama support
  • Multi-modal — images, PDFs, videos via LLM vision APIs
  • Distributed execution — Redis/Valkey-backed work queues
  • Circuit breaker — graceful degradation when services fail
  • Idempotency — safe retries with deduplication keys

Anti-detection browser automation library for bypassing modern bot protection:

  • Browser pooling — warm pool, sub-100ms acquisition
  • CDP-based — Chrome DevTools Protocol via chromiumoxide
  • Stealth features — navigator spoofing, canvas noise, WebGL randomization
  • Human behavior — Bézier mouse paths, realistic typing
  • Cloudflare/DataDome/PerimeterX — bypass detection layers

Quick Start

Graph Scraping Pipeline

use stygian_graph::{PipelineBuilder, adapters::HttpAdapter};
use serde_json::json;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let pipeline = PipelineBuilder::new()
        .node("fetch", HttpAdapter::new())
        .node("parse", MyParserAdapter)
        .edge("fetch", "parse")
        .build()?;

    let results = pipeline
        .execute(json!({"url": "https://example.com"}))
        .await?;
    
    println!("Results: {:?}", results);
    Ok(())
}

Browser Automation

use stygian_browser::{BrowserConfig, BrowserPool};
use std::time::Duration;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let pool = BrowserPool::new(BrowserConfig::default()).await?;
    let handle = pool.acquire().await?;
    
    let mut page = handle.browser().new_page().await?;
    page.navigate(
        "https://example.com",
        WaitUntil::Selector("body".to_string()),
        Duration::from_secs(30),
    ).await?;
    
    let html = page.content().await?;
    println!("Page loaded: {} bytes", html.len());
    
    handle.release().await;
    Ok(())
}

Installation

Add to your Cargo.toml:

[dependencies]
stygian-graph = "0.2"
stygian-browser = "0.2"  # optional, for JavaScript rendering
tokio = { version = "1", features = ["full"] }

Architecture

stygian-graph: Hexagonal (Ports & Adapters)

Domain Layer (business logic)
    ↑
Ports (trait definitions)
    ↑
Adapters (HTTP, browser, AI providers, storage)
  • Zero I/O dependencies in domain layer
  • Dependency inversion — adapters depend on ports, not vice versa
  • Extreme testability — mock any external system

stygian-browser: Modular

  • Self-contained modules with clear interfaces
  • Pool management with resource limits
  • Graceful degradation on browser unavailability

Project Structure

stygian/
├── crates/
│   ├── stygian-graph/      # Scraping engine
│   └── stygian-browser/    # Browser automation
├── examples/                # Example pipelines
├── docs/                    # Architecture docs
└── assets/                  # Diagrams, images

Development

Setup

# Install Rust 1.94.0+
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Build workspace
cargo build --workspace

# Run tests
cargo test --workspace

# Run clippy
cargo clippy --workspace -- -D warnings

Testing

# Unit tests
cargo test --lib

# Integration tests
cargo test --test '*'

# All tests (browser integration tests require Chrome)
cargo test --all-features

# Measure coverage (requires cargo-tarpaulin)
cargo tarpaulin --workspace --all-features --ignore-tests --out Lcov

stygian-graph achieves strong unit coverage across domain, ports, and adapter layers. stygian-browser coverage is structurally bounded by the Chrome CDP requirement — all tests that spin up a real browser are marked #[ignore = "requires Chrome"]; pure-logic tests are fully covered.


Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'feat: add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Commit Convention

Use Conventional Commits:

  • feat: — new feature
  • fix: — bug fix
  • refactor: — code restructuring
  • test: — test additions/changes
  • docs: — documentation updates

License

Licensed under the GNU Affero General Public License v3.0 (AGPL-3.0-only).

This means any modifications or derivative works must also be released under the AGPL-3.0, including when the software is used to provide a network service.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you shall be licensed under the AGPL-3.0-only, without any additional terms or conditions.


Acknowledgments

Built with:


Status: Active development | Version 0.2.0 | Rust 2024 edition | Linux + macOS

For detailed documentation, see the project docs site.