Skip to content

Commit 42aee17

Browse files
authored
feat(graph): add RestApiAdapter and CloudflareCrawlAdapter (v0.1.15) (#16)
* feat(graph): add RestApiAdapter and CloudflareCrawlAdapter - RestApiAdapter: flexible REST JSON API scraping with 5 auth schemes (Bearer, Basic, API key header/query, none), 4 pagination strategies (none, offset, cursor, RFC 8288 Link header), dot-path response extraction, and configurable exponential-backoff retry; 24 unit tests - CloudflareCrawlAdapter: managed whole-site crawling via Cloudflare Browser Rendering /crawl endpoint; feature-gated behind cloudflare-crawl - examples/rest-api-scrape.toml: unauthenticated GET, Bearer + link_header pagination, and api_key_header + cursor pagination patterns - fix: resolve all clippy -D warnings in rest_api.rs and cloudflare_crawl.rs - docs: update adapters.md, graphql-plugins.md, configuration.md, env-vars.md - chore: bump version to 0.1.15, update CHANGELOG and README * fix: update quinn-proto to 0.11.14 (RUSTSEC-2026-0037), fix rustdoc links
1 parent 281ea82 commit 42aee17

File tree

16 files changed

+2397
-49
lines changed

16 files changed

+2397
-49
lines changed

CHANGELOG.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,18 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.1.15] - 2026-03-13
11+
12+
### Added
13+
14+
- `stygian-graph`: `RestApiAdapter` — flexible REST JSON API adapter with 5 auth schemes (Bearer, Basic, API key header/query, none), 4 pagination strategies (none, offset, cursor, RFC 8288 Link header), dot-path JSON response extraction, configurable retries with exponential backoff, and 24 unit tests; registered as `"rest-api"`
15+
- `stygian-graph`: `CloudflareCrawlAdapter` — delegates whole-site crawling to the Cloudflare Browser Rendering `/crawl` endpoint (open beta); polls until complete, aggregates page results, configurable poll interval and job timeout; gated behind `cloudflare-crawl` feature flag
16+
- `examples/rest-api-scrape.toml` — example pipeline demonstrating unauthenticated GET, Bearer-auth + Link-header pagination, and API-key + cursor pagination patterns
17+
18+
### Fixed
19+
20+
- `stygian-graph`: resolved all `clippy -D warnings` lint failures in `rest_api.rs` and `cloudflare_crawl.rs``indexing_slicing`, `map_unwrap_or`, `manual_map`, `if_not_else`, `option_if_let_else`, `unnecessary_map_or`, `cast_possible_truncation`, `ignore_without_reason`, `panic` in tests
21+
1022
## [0.1.14] - 2026-03-04
1123

1224
### Fixed

Cargo.lock

Lines changed: 4 additions & 4 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ members = [
66
]
77

88
[workspace.package]
9-
version = "0.1.14"
9+
version = "0.1.15"
1010
edition = "2024"
1111
rust-version = "1.93.1"
1212
authors = ["Nick Campbell <s0ma@protonmail.com>"]

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -230,6 +230,6 @@ Built with:
230230

231231
---
232232

233-
**Status**: Active development | Version 0.1.1 | Rust 2024 edition | 842 tests | Linux + macOS
233+
**Status**: Active development | Version 0.1.15 | Rust 2024 edition | 694 tests | Linux + macOS
234234

235235
For detailed documentation, see the [project docs site](https://greysquirr3l.github.io/stygian).

book/src/browser/configuration.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ let config = BrowserConfig::builder()
5050
| Field | Type | Default | Description |
5151
|---|---|---|---|
5252
| `headless` | `bool` | `true` | Run without visible window |
53-
| `headless_mode` | `HeadlessMode` | `New` | `New` = `--headless=new` (same renderer as headed Chrome); `Legacy` = classic `--headless` (Chromium < 112 only) |
53+
| `headless_mode` | `HeadlessMode` | `New` | `New` = `--headless=new` (full Chromium rendering, default since Chrome 112, **only mode since Chrome 132**); `Legacy` = `chrome-headless-shell` / pre-112 `--headless` |
5454
| `window_size` | `Option<(u32, u32)>` | `(1920, 1080)` | Browser viewport dimensions |
5555
| `chrome_path` | `Option<PathBuf>` | auto-detect | Path to Chrome/Chromium binary |
5656
| `stealth_level` | `StealthLevel` | `Advanced` | Anti-detection level |
@@ -92,7 +92,7 @@ All config values can be overridden without touching source code:
9292
|---|---|---|
9393
| `STYGIAN_CHROME_PATH` | auto-detect | Path to Chrome/Chromium binary |
9494
| `STYGIAN_HEADLESS` | `true` | Set `false` for headed mode |
95-
| `STYGIAN_HEADLESS_MODE` | `new` | `new` (`--headless=new`) or `legacy` (classic `--headless`) |
95+
| `STYGIAN_HEADLESS_MODE` | `new` | `new` (`--headless=new`) or `legacy` (`chrome-headless-shell`; old `--headless` removed in Chrome 132) |
9696
| `STYGIAN_STEALTH_LEVEL` | `advanced` | `none`, `basic`, `advanced` |
9797
| `STYGIAN_POOL_MIN` | `2` | Minimum warm browsers |
9898
| `STYGIAN_POOL_MAX` | `10` | Maximum concurrent browsers |
@@ -160,9 +160,17 @@ let config = BrowserConfig::builder()
160160
.build();
161161
```
162162

163-
For Chromium < 112 (rare), fall back to the legacy mode:
163+
For Chromium ≥ 112 (all modern Chrome / Chromium builds), `New` is the right
164+
choice. `Legacy` targets are rare: pre-112 Chromium or the separately distributed
165+
`chrome-headless-shell` binary for lightweight CI workloads where full rendering
166+
fidelity is not required.
167+
168+
> **Note:** As of Chrome 132 the old `--headless` flag is removed entirely.
169+
> `HeadlessMode::Legacy` now maps to `chrome-headless-shell` semantics — avoid it
170+
> unless you are explicitly targeting that binary.
164171
165172
```rust,no_run
173+
// Only needed for Chromium < 112 or chrome-headless-shell
166174
let config = BrowserConfig::builder()
167175
.headless_mode(HeadlessMode::Legacy)
168176
.build();

book/src/graph/adapters.md

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,136 @@ let adapter = HttpAdapter::with_config(HttpConfig {
3636

3737
---
3838

39+
## REST API Adapter
40+
41+
Purpose-built for structured JSON REST APIs. Handles authentication, automatic
42+
multi-strategy pagination, JSON response extraction, and retry — without the caller
43+
needing to manage any of that manually.
44+
45+
```rust
46+
use stygian_graph::adapters::rest_api::{RestApiAdapter, RestApiConfig};
47+
use stygian_graph::ports::{ScrapingService, ServiceInput};
48+
use serde_json::json;
49+
use std::time::Duration;
50+
51+
let adapter = RestApiAdapter::with_config(RestApiConfig {
52+
timeout: Duration::from_secs(20),
53+
max_retries: 3,
54+
..Default::default()
55+
});
56+
57+
let input = ServiceInput {
58+
url: "https://api.github.com/repos/rust-lang/rust/issues".to_string(),
59+
params: json!({
60+
"auth": { "type": "bearer", "token": "${env:GITHUB_TOKEN}" },
61+
"query": { "state": "open", "per_page": "100" },
62+
"pagination": { "strategy": "link_header", "max_pages": 10 },
63+
"response": { "data_path": "" }
64+
}),
65+
};
66+
// let output = adapter.execute(input).await?;
67+
```
68+
69+
**Registered service name**: `"rest-api"`
70+
71+
### Config fields
72+
73+
| Field | Default | Description |
74+
|---|---|---|
75+
| `timeout` | 30 s | Per-request timeout |
76+
| `max_retries` | 3 | Retry attempts on transient errors (`429`, `5xx`, network) |
77+
| `retry_base_delay` | 1 s | Base for exponential backoff |
78+
| `proxy_url` | `None` | HTTP/HTTPS/SOCKS5 proxy URL |
79+
80+
### `ServiceInput.params` contract
81+
82+
| Param | Required | Default | Description |
83+
|---|---|---|---|
84+
| `method` || `"GET"` | `GET`, `POST`, `PUT`, `PATCH`, `DELETE`, `HEAD` |
85+
| `body` ||| JSON body for `POST`/`PUT`/`PATCH` |
86+
| `body_raw` ||| Raw string body (takes precedence over `body`) |
87+
| `headers` ||| Extra request headers object |
88+
| `query` ||| Extra query string parameters object |
89+
| `accept` || `"application/json"` | `Accept` header |
90+
| `auth` || none | Authentication object (see below) |
91+
| `response.data_path` || full body | Dot path into the JSON response to extract |
92+
| `response.collect_as_array` || `false` | Force multi-page results into a JSON array |
93+
| `pagination.strategy` || `"none"` | `"none"`, `"offset"`, `"cursor"`, `"link_header"` |
94+
| `pagination.max_pages` || `1` | Maximum pages to fetch |
95+
96+
### Authentication
97+
98+
```toml
99+
# Bearer token
100+
[nodes.params.auth]
101+
type = "bearer"
102+
token = "${env:API_TOKEN}"
103+
104+
# HTTP Basic
105+
[nodes.params.auth]
106+
type = "basic"
107+
username = "${env:API_USER}"
108+
password = "${env:API_PASS}"
109+
110+
# API key in header
111+
[nodes.params.auth]
112+
type = "api_key_header"
113+
header = "X-Api-Key"
114+
key = "${env:API_KEY}"
115+
116+
# API key in query string
117+
[nodes.params.auth]
118+
type = "api_key_query"
119+
param = "api_key"
120+
key = "${env:API_KEY}"
121+
```
122+
123+
### Pagination strategies
124+
125+
| Strategy | How it works | Best for |
126+
|---|---|---|
127+
| `"none"` | Single request | Simple endpoints |
128+
| `"offset"` | Increments `page_param` from `start_page` | REST APIs with `?page=N` |
129+
| `"cursor"` | Extracts next cursor from `cursor_field` (dot path), sends as `cursor_param` | GraphQL-REST hybrids, Stripe-style |
130+
| `"link_header"` | Follows RFC 8288 `Link: <url>; rel="next"` | GitHub API, GitLab API |
131+
132+
#### Offset example
133+
134+
```toml
135+
[nodes.params.pagination]
136+
strategy = "offset"
137+
page_param = "page"
138+
page_size_param = "per_page"
139+
page_size = 100
140+
start_page = 1
141+
max_pages = 20
142+
```
143+
144+
#### Cursor example
145+
146+
```toml
147+
[nodes.params.pagination]
148+
strategy = "cursor"
149+
cursor_param = "after"
150+
cursor_field = "meta.next_cursor"
151+
max_pages = 50
152+
```
153+
154+
### Output
155+
156+
`ServiceOutput.data` — pretty-printed JSON string of the extracted data.
157+
158+
`ServiceOutput.metadata`:
159+
160+
```json
161+
{
162+
"url": "https://...",
163+
"page_count": 3
164+
}
165+
```
166+
167+
---
168+
39169
## Browser Adapter
40170

41171
Delegates to `stygian-browser` for JavaScript-rendered pages. Requires the `browser`
@@ -260,3 +390,104 @@ let service = GraphQlService::new(GraphQlConfig::default(), Some(Arc::new(regist
260390

261391
See the [GraphQL Plugins](./graphql-plugins.md) page for the full builder reference,
262392
`AuthPort` implementation guide, proactive cost throttling, and custom plugin examples.
393+
394+
---
395+
396+
## Cloudflare Browser Rendering adapter
397+
398+
Submits a multi-page crawl job to the [Cloudflare Browser Rendering API](https://developers.cloudflare.com/browser-rendering/),
399+
polls until it completes, and returns the aggregated content. All page rendering is done
400+
inside Cloudflare's infrastructure — no local Chrome binary needed.
401+
402+
**Feature flag**: `cloudflare-crawl` (not included in `default` or `browser`; add it
403+
explicitly or use `full`).
404+
405+
### Quick start
406+
407+
```toml
408+
# Cargo.toml
409+
[dependencies]
410+
stygian-graph = { version = "0.1", features = ["cloudflare-crawl"] }
411+
```
412+
413+
```rust
414+
use stygian_graph::adapters::cloudflare_crawl::{
415+
CloudflareCrawlAdapter, CloudflareCrawlConfig,
416+
};
417+
use std::time::Duration;
418+
419+
let adapter = CloudflareCrawlAdapter::with_config(CloudflareCrawlConfig {
420+
poll_interval: Duration::from_secs(3),
421+
job_timeout: Duration::from_secs(120),
422+
..Default::default()
423+
});
424+
```
425+
426+
**Registered service name**: `"cloudflare-crawl"`
427+
428+
### `ServiceInput.params` contract
429+
430+
All per-request options are passed via `ServiceInput.params`. `account_id` and
431+
`api_token` are **required**; the rest are optional and forwarded verbatim to the
432+
Cloudflare API.
433+
434+
| Param key | Required | Default | Description |
435+
|---|---|---|---|
436+
| `account_id` ||| Cloudflare account ID |
437+
| `api_token` ||| Cloudflare API token with Browser Rendering permission |
438+
| `output_format` || `"markdown"` | `"markdown"`, `"html"`, or `"raw"` |
439+
| `max_depth` || API default | Maximum crawl depth from the seed URL |
440+
| `max_pages` || API default | Maximum pages to crawl |
441+
| `url_pattern` || API default | Regex or glob restricting which URLs are followed |
442+
| `modified_since` || API default | ISO-8601 timestamp; skip pages not modified since |
443+
| `max_age_seconds` || API default | Skip cached pages older than this many seconds |
444+
| `static_mode` || `false` | Set `"true"` to skip JS execution (faster, static HTML only) |
445+
446+
### Config fields
447+
448+
| Field | Default | Description |
449+
|---|---|---|
450+
| `poll_interval` | 2 s | How often to poll for job completion |
451+
| `job_timeout` | 5 min | Hard timeout per crawl job; returns `ServiceError::Timeout` if exceeded |
452+
453+
### Output
454+
455+
`ServiceOutput.data` contains the page content of all crawled pages joined by newlines.
456+
`ServiceOutput.metadata` is a JSON object:
457+
458+
```json
459+
{
460+
"job_id": "some-uuid",
461+
"pages": 12,
462+
"url_count": 12
463+
}
464+
```
465+
466+
### TOML pipeline usage
467+
468+
```toml
469+
[[nodes]]
470+
id = "crawl"
471+
type = "scrape"
472+
target = "https://docs.example.com"
473+
474+
[nodes.params]
475+
account_id = "${env:CF_ACCOUNT_ID}"
476+
api_token = "${env:CF_API_TOKEN}"
477+
output_format = "markdown"
478+
max_depth = "3"
479+
max_pages = "50"
480+
url_pattern = "https://docs.example.com/**"
481+
482+
[nodes.service]
483+
name = "cloudflare-crawl"
484+
```
485+
486+
### Error mapping
487+
488+
| Condition | `StygianError` variant |
489+
|---|---|
490+
| Missing `account_id` or `api_token` | `ServiceError::Unavailable` |
491+
| Cloudflare API non-2xx | `ServiceError::Unavailable` (with CF error code) |
492+
| Job still pending after `job_timeout` | `ServiceError::Timeout` |
493+
| Unexpected response shape | `ServiceError::InvalidResponse` |

0 commit comments

Comments
 (0)