|
| 1 | +# wayparam |
| 2 | + |
| 3 | +**wayparam** is a modern, cross-platform CLI tool to **fetch historical URLs from the Internet Archive Wayback CDX API**, filter out “boring” URLs (static assets), and **normalize query parameters** so you can focus on endpoints that actually matter. |
| 4 | + |
| 5 | +This project is **inspired by ParamSpider** (same overall goal, completely rewritten with a more robust architecture, modern async I/O, better filtering, and production-friendly output behavior). |
| 6 | + |
| 7 | +> OSINT tool: **wayparam does not crawl targets**. It only queries the Wayback CDX API. |
| 8 | +
|
| 9 | +--- |
| 10 | + |
| 11 | +## Key features |
| 12 | + |
| 13 | +- **Wayback CDX API** URL collection (single domain or list) |
| 14 | +- **Async + concurrency** for speed on multiple domains |
| 15 | +- **Rate limiting** (`--rps`) to be polite with Wayback/CDX |
| 16 | +- **Retry + backoff** and clearer error messages |
| 17 | +- **CDX pagination** (resumeKey) when available |
| 18 | +- Filters “boring” URLs by: |
| 19 | + - extension blacklist/whitelist |
| 20 | + - optional path regex exclusion |
| 21 | +- **Canonicalization & normalization** |
| 22 | + - drop fragments |
| 23 | + - normalize host/ports |
| 24 | + - sort parameters |
| 25 | + - mask parameter values (default placeholder: `FUZZ`) |
| 26 | + - optional tracking parameter removal (utm_*, gclid, fbclid, …) |
| 27 | +- Output: |
| 28 | + - per-domain files (default) |
| 29 | + - **stdout streaming** for pipelines (`--stdout`) |
| 30 | + - `txt` or `jsonl` output (`--format`) |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +## Installation |
| 35 | + |
| 36 | +### From source (recommended for now) |
| 37 | + |
| 38 | +```bash |
| 39 | +python -m venv .venv |
| 40 | +# Windows: .venv\Scripts\activate |
| 41 | +# macOS/Linux: source .venv/bin/activate |
| 42 | +python -m pip install -U pip |
| 43 | +pip install -e . |
| 44 | +```` |
| 45 | + |
| 46 | +### Development install (tests + lint) |
| 47 | + |
| 48 | +```bash |
| 49 | +pip install -e ".[dev]" |
| 50 | +``` |
| 51 | + |
| 52 | +--- |
| 53 | + |
| 54 | +## Quick start |
| 55 | + |
| 56 | +### 1) Single domain (writes to `results/`) |
| 57 | + |
| 58 | +```bash |
| 59 | +wayparam -d example.com |
| 60 | +``` |
| 61 | + |
| 62 | +### 2) List of domains |
| 63 | + |
| 64 | +```bash |
| 65 | +wayparam -l domains.txt |
| 66 | +``` |
| 67 | + |
| 68 | +### 3) Stream to stdout (for piping), no files |
| 69 | + |
| 70 | +```bash |
| 71 | +wayparam -d example.com --stdout --no-files |
| 72 | +``` |
| 73 | + |
| 74 | +### 4) JSONL output (great for tooling) |
| 75 | + |
| 76 | +```bash |
| 77 | +wayparam -d example.com --stdout --no-files --format jsonl |
| 78 | +``` |
| 79 | + |
| 80 | +### 5) Include subdomains + be polite to Wayback |
| 81 | + |
| 82 | +```bash |
| 83 | +wayparam -d example.com --include-subdomains --rps 1 --concurrency 2 |
| 84 | +``` |
| 85 | + |
| 86 | +### 6) Customize filtering (extensions + path regex) |
| 87 | + |
| 88 | +```bash |
| 89 | +wayparam -d example.com --ext-blacklist ".png,.jpg,.css,.js" --exclude-path-regex "^/static/" |
| 90 | +``` |
| 91 | + |
| 92 | +--- |
| 93 | + |
| 94 | +## How it works (under the hood) |
| 95 | + |
| 96 | +1. **Input parsing** |
| 97 | + |
| 98 | + * `-d/--domain` for a single host |
| 99 | + * `-l/--list` for multiple hosts (one per line, supports comments and basic normalization) |
| 100 | + |
| 101 | +2. **Query the Wayback CDX API** |
| 102 | + |
| 103 | + * Requests are sent to the CDX endpoint (Wayback Machine) |
| 104 | + * Uses `matchType=host` by default, or `matchType=domain` when `--include-subdomains` is enabled |
| 105 | + * Uses pagination (resumeKey) when the API provides it |
| 106 | + |
| 107 | +3. **Filter “boring” URLs** |
| 108 | + |
| 109 | + * Drops URLs that look like static assets (by extension), with optional whitelist mode |
| 110 | + * Optional regex filters can exclude paths (e.g., `/static/`, `/assets/`, …) |
| 111 | + |
| 112 | +4. **Canonicalize + normalize** |
| 113 | + |
| 114 | + * Removes fragments (`#...`) |
| 115 | + * Normalizes default ports (`:80`, `:443`) |
| 116 | + * Parses query string and: |
| 117 | + |
| 118 | + * replaces values with a placeholder (default `FUZZ`) |
| 119 | + * optionally drops tracking parameters |
| 120 | + * sorts parameters for stable output |
| 121 | + * Deduplicates results |
| 122 | + |
| 123 | +5. **Output** |
| 124 | + |
| 125 | + * By default writes per-domain results into `results/` |
| 126 | + * `--stdout` streams machine-readable output |
| 127 | + * Diagnostics (hints, logs, stats) go to **stderr** (safe for pipelines) |
| 128 | + |
| 129 | +--- |
| 130 | + |
| 131 | +## Output behavior (important for pipelines) |
| 132 | + |
| 133 | +* **stdout**: only results (URLs or JSONL) when `--stdout` is enabled |
| 134 | +* **stderr**: logs, errors, hints (VPN/proxy), optional stats |
| 135 | + |
| 136 | +This means you can safely do: |
| 137 | + |
| 138 | +```bash |
| 139 | +wayparam -d example.com --stdout --no-files | sort -u > urls.txt |
| 140 | +``` |
| 141 | + |
| 142 | +--- |
| 143 | + |
| 144 | +## Common options |
| 145 | + |
| 146 | +### Wayback/CDX |
| 147 | + |
| 148 | +* `--include-subdomains` |
| 149 | +* `--from 2019` / `--to 2021` (or full timestamps like `20190101000000`) |
| 150 | +* `--filter statuscode:200` (repeatable) |
| 151 | +* `--no-collapse` (more duplicates, more data) |
| 152 | + |
| 153 | +### Normalization |
| 154 | + |
| 155 | +* `--placeholder X` |
| 156 | +* `--keep-values` (not recommended if you share logs) |
| 157 | +* `--drop-tracking` / `--no-drop-tracking` |
| 158 | +* `--all-urls` (include URLs without query parameters) |
| 159 | + |
| 160 | +### Filtering |
| 161 | + |
| 162 | +* `--ext-blacklist ".png,.jpg,.css,.js"` |
| 163 | +* `--ext-whitelist ".php,.asp,.aspx"` |
| 164 | +* `--exclude-path-regex "regex"` (repeatable) |
| 165 | + |
| 166 | +### Performance / network |
| 167 | + |
| 168 | +* `--concurrency 8` |
| 169 | +* `--rps 1` (recommended when using VPNs / noisy networks) |
| 170 | +* `--timeout 30` |
| 171 | +* `--retries 4` |
| 172 | +* `--proxy http://127.0.0.1:8080` |
| 173 | + |
| 174 | +--- |
| 175 | + |
| 176 | +## Troubleshooting: VPN / Proxy issues (Wayback CDX) |
| 177 | + |
| 178 | +If you see errors like “failed after retries” against the CDX endpoint, it often means: |
| 179 | + |
| 180 | +* the VPN/proxy exit node is **blocked** or **rate-limited** by Wayback |
| 181 | +* your VPN does TLS filtering or networking policies that break automated requests |
| 182 | + |
| 183 | +Try: |
| 184 | + |
| 185 | +* disconnecting VPN/proxy and rerunning |
| 186 | +* switching to a different VPN server |
| 187 | +* lowering `--concurrency` and setting `--rps 1` |
| 188 | + |
| 189 | +wayparam will print a **human-readable hint in English** to stderr when it detects this pattern. |
| 190 | + |
| 191 | +--- |
| 192 | + |
| 193 | +## Man page |
| 194 | + |
| 195 | +A manual page is included: |
| 196 | + |
| 197 | +```bash |
| 198 | +man ./man/wayparam.1 |
| 199 | +``` |
| 200 | + |
| 201 | +--- |
| 202 | + |
| 203 | +## Testing |
| 204 | + |
| 205 | +Install dev dependencies and run: |
| 206 | + |
| 207 | +```bash |
| 208 | +pip install -e ".[dev]" |
| 209 | +pytest -q |
| 210 | +``` |
| 211 | + |
| 212 | +The test suite includes **httpx-level integration tests** using `httpx.MockTransport` (no network). |
| 213 | + |
| 214 | +--- |
| 215 | + |
| 216 | +## License |
| 217 | + |
| 218 | +wayparam is **free software** released under the **GNU General Public License v3 (GPLv3)**. |
| 219 | +See the `LICENSE` file for details. |
| 220 | + |
| 221 | +--- |
| 222 | + |
| 223 | +## Acknowledgements |
| 224 | + |
| 225 | +* Inspired by **ParamSpider** (same objective: fetch Wayback URLs, filter noise, focus on parameterized endpoints). |
| 226 | +* Thanks to the OSINT / security community for patterns and workflows around URL collection and parameter discovery. |
| 227 | + |
| 228 | +--- |
| 229 | + |
| 230 | +## Disclaimer |
| 231 | + |
| 232 | +Use responsibly and lawfully. This tool queries the Internet Archive and does not actively scan targets, but your downstream usage of collected URLs may have legal and ethical implications depending on context. |
0 commit comments