Skip to content

Commit f97d5c7

Browse files
committed
Add MkDocs documentation
1 parent d3429aa commit f97d5c7

16 files changed

+947
-0
lines changed

docs/api.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Internal API reference (for contributors)
2+
3+
> This is not a public library API. It is provided to help contributors and maintainers.
4+
5+
## `wayparam.http`
6+
7+
### `HttpConfig`
8+
Fields:
9+
- `timeout_s: float` (default: 30.0)
10+
- `retries: int` (default: 4)
11+
- `backoff_base_s: float`
12+
- `max_backoff_s: float`
13+
- `user_agent: Optional[str]`
14+
- `proxy: Optional[str]`
15+
16+
### `get_text(client, url, params, config) -> str`
17+
Performs a GET request and returns response text, with retry/backoff behavior.
18+
19+
Raises `RuntimeError` after retries with a message like:
20+
- `HTTP request failed after retries (status=503): ...`
21+
- `HTTP request failed after retries (no-status): ...`
22+
23+
## `wayparam.wayback`
24+
25+
### `CdxOptions`
26+
Fields:
27+
- `include_subdomains: bool`
28+
- `collapse: str | None` (default: `urlkey`)
29+
- `from_ts: str | None`
30+
- `to_ts: str | None`
31+
- `limit: int`
32+
- `filters: list[str] | None`
33+
34+
### `iter_original_urls(domain, client, http_config, rate_limiter, opt) -> AsyncIterator[str]`
35+
Yields “original” URLs from the CDX API, handling paging/resumeKey.
36+
37+
## `wayparam.normalize`
38+
39+
### `NormalizeOptions`
40+
Fields:
41+
- `placeholder: str`
42+
- `keep_values: bool`
43+
- `only_params: bool`
44+
- `drop_tracking: bool`
45+
- `drop_empty: bool`
46+
- `sort_params: bool`
47+
48+
### `canonicalize_url(url, opt) -> str | None`
49+
Returns a canonicalized URL or `None` if filtered out or invalid.
50+
51+
## `wayparam.filters`
52+
53+
### `FilterOptions`
54+
Fields:
55+
- `ext_blacklist: set[str]`
56+
- `ext_whitelist: set[str] | None`
57+
- `path_exclude_regex: list[re.Pattern] | None`
58+
59+
### `is_boring(url, opt) -> bool`
60+
Returns True if the URL should be filtered out as “boring”.
61+
62+
## `wayparam.output`
63+
64+
### `UrlRecord`
65+
Fields:
66+
- `domain: str`
67+
- `url: str`
68+
- `source: str` (default: `wayback`)
69+
- `fetched_at: str | None`
70+
71+
### `write_record(fh, rec, fmt)`
72+
Writes one record to a file handle.
73+
74+
### `print_record_stdout(rec, fmt)`
75+
Prints one record to stdout.
76+
77+
### `print_hint_stderr(message)`
78+
Prints diagnostics to stderr.

docs/architecture.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Architecture
2+
3+
wayparam is intentionally modular. Each module has a single responsibility, which makes the tool easier to test, audit, and package.
4+
5+
## High-level data flow
6+
7+
1. **cli.py**
8+
- Parses args
9+
- Builds option objects
10+
- Orchestrates concurrency
11+
2. **wayback.py**
12+
- Builds CDX query parameters
13+
- Handles pagination/resumeKey
14+
3. **http.py**
15+
- Makes resilient HTTP requests (retries, backoff)
16+
4. **filters.py**
17+
- Drops “boring” URLs (static assets) early
18+
5. **normalize.py**
19+
- Canonicalizes and normalizes URLs (stable output)
20+
6. **output.py**
21+
- Writes records to files and/or stdout (txt/jsonl)
22+
7. **ratelimit.py**
23+
- Global RPS limiter (optional)
24+
25+
## Why this structure matters
26+
27+
- unit tests focus on pure logic (`normalize.py`, `filters.py`, parsing)
28+
- integration tests mock HTTP at the transport layer (httpx MockTransport)
29+
- CLI stays pipeline-friendly: stdout is clean and predictable

docs/cli/examples.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Practical CLI examples
2+
3+
This page contains practical, copy/paste-friendly examples.
4+
5+
---
6+
7+
## Basic collection
8+
9+
```bash
10+
wayparam -d example.com
11+
```
12+
13+
Writes results to:
14+
- `results/example.com.txt` (default format `txt`)
15+
16+
---
17+
18+
## Multiple domains
19+
20+
```bash
21+
wayparam -l domains.txt
22+
```
23+
24+
---
25+
26+
## Streaming (pipelines)
27+
28+
### Deduplicate and save
29+
```bash
30+
wayparam -d example.com --stdout --no-files | sort -u > urls.txt
31+
```
32+
33+
### Filter by keyword
34+
```bash
35+
wayparam -d example.com --stdout --no-files | grep -i "redirect"
36+
```
37+
38+
### JSONL + jq
39+
```bash
40+
wayparam -d example.com --stdout --no-files --format jsonl | jq -r '.url' | sort -u
41+
```
42+
43+
---
44+
45+
## Subdomains + time filters
46+
47+
```bash
48+
wayparam -d example.com --include-subdomains --from 2018 --to 2021
49+
```
50+
51+
---
52+
53+
## Focus on “dynamic” endpoints
54+
55+
Exclude typical static paths:
56+
```bash
57+
wayparam -d example.com --exclude-path-regex "^/static/" --exclude-path-regex "^/assets/"
58+
```
59+
60+
---
61+
62+
## Being polite to Wayback (recommended)
63+
64+
```bash
65+
wayparam -l domains.txt --rps 1 --concurrency 2
66+
```
67+
68+
---
69+
70+
## Keep parameter values (careful)
71+
72+
```bash
73+
wayparam -d example.com --keep-values
74+
```
75+
76+
Use only if you understand the privacy implications.

0 commit comments

Comments
 (0)