Skip to content

Commit 4bf918d

Browse files
committed
Release
0 parents  commit 4bf918d

31 files changed

+1613
-0
lines changed

.github/workflows/ci.yml

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
pull_request:
6+
7+
jobs:
8+
test:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/checkout@v4
12+
- uses: actions/setup-python@v5
13+
with:
14+
python-version: '3.11'
15+
- name: Install
16+
run: |
17+
python -m pip install -U pip
18+
pip install -e '.[dev]'
19+
- name: Lint
20+
run: ruff check .
21+
- name: Tests
22+
run: pytest -q

.gitignore

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# ---------- Outputs / runtime artifacts ----------
2+
results/
3+
4+
# ---------- Python bytecode / caches ----------
5+
__pycache__/
6+
.pytest_cache/
7+
8+
# ---------- Virtual environments ----------
9+
.venv/
10+
11+
# ---------- Secrets ----------
12+
.env

CHANGELOG.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Changelog
2+
3+
## 0.2.3
4+
- Added httpx-level integration tests using MockTransport (no network)
5+
6+
## 0.2.2
7+
- Fixed pyproject.toml table ordering (project.urls) for editable installs
8+
9+
## 0.2.1
10+
- Added man page (man/wayparam.1)
11+
12+
## 0.2.0
13+
- Added output formats (txt/jsonl) and safer diagnostics to stderr
14+
- Added per-domain stats (optional)
15+
- Improved HTTP error messages with status/no-status
16+
- Added basic tests and CI config
17+
18+
## 0.1.0
19+
- Initial release

CODE_OF_CONDUCT.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Code of Conduct
2+
3+
Be respectful. Assume good intent. No harassment.

CONTRIBUTING.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Contributing
2+
3+
## Dev setup
4+
```bash
5+
python -m venv .venv
6+
# Windows: .venv\Scripts\activate
7+
# macOS/Linux: source .venv/bin/activate
8+
pip install -U pip
9+
pip install -e '.[dev]'
10+
pytest
11+
ruff check .
12+
```
13+
14+
## Guidelines
15+
- Keep stdout machine-readable (URLs / JSONL). Put diagnostics on stderr.
16+
- Add unit tests for parsing/normalization changes.

README.md

Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
# wayparam
2+
3+
**wayparam** is a modern, cross-platform CLI tool to **fetch historical URLs from the Internet Archive Wayback CDX API**, filter out “boring” URLs (static assets), and **normalize query parameters** so you can focus on endpoints that actually matter.
4+
5+
This project is **inspired by ParamSpider** (same overall goal, completely rewritten with a more robust architecture, modern async I/O, better filtering, and production-friendly output behavior).
6+
7+
> OSINT tool: **wayparam does not crawl targets**. It only queries the Wayback CDX API.
8+
9+
---
10+
11+
## Key features
12+
13+
- **Wayback CDX API** URL collection (single domain or list)
14+
- **Async + concurrency** for speed on multiple domains
15+
- **Rate limiting** (`--rps`) to be polite with Wayback/CDX
16+
- **Retry + backoff** and clearer error messages
17+
- **CDX pagination** (resumeKey) when available
18+
- Filters “boring” URLs by:
19+
- extension blacklist/whitelist
20+
- optional path regex exclusion
21+
- **Canonicalization & normalization**
22+
- drop fragments
23+
- normalize host/ports
24+
- sort parameters
25+
- mask parameter values (default placeholder: `FUZZ`)
26+
- optional tracking parameter removal (utm_*, gclid, fbclid, …)
27+
- Output:
28+
- per-domain files (default)
29+
- **stdout streaming** for pipelines (`--stdout`)
30+
- `txt` or `jsonl` output (`--format`)
31+
32+
---
33+
34+
## Installation
35+
36+
### From source (recommended for now)
37+
38+
```bash
39+
python -m venv .venv
40+
# Windows: .venv\Scripts\activate
41+
# macOS/Linux: source .venv/bin/activate
42+
python -m pip install -U pip
43+
pip install -e .
44+
````
45+
46+
### Development install (tests + lint)
47+
48+
```bash
49+
pip install -e ".[dev]"
50+
```
51+
52+
---
53+
54+
## Quick start
55+
56+
### 1) Single domain (writes to `results/`)
57+
58+
```bash
59+
wayparam -d example.com
60+
```
61+
62+
### 2) List of domains
63+
64+
```bash
65+
wayparam -l domains.txt
66+
```
67+
68+
### 3) Stream to stdout (for piping), no files
69+
70+
```bash
71+
wayparam -d example.com --stdout --no-files
72+
```
73+
74+
### 4) JSONL output (great for tooling)
75+
76+
```bash
77+
wayparam -d example.com --stdout --no-files --format jsonl
78+
```
79+
80+
### 5) Include subdomains + be polite to Wayback
81+
82+
```bash
83+
wayparam -d example.com --include-subdomains --rps 1 --concurrency 2
84+
```
85+
86+
### 6) Customize filtering (extensions + path regex)
87+
88+
```bash
89+
wayparam -d example.com --ext-blacklist ".png,.jpg,.css,.js" --exclude-path-regex "^/static/"
90+
```
91+
92+
---
93+
94+
## How it works (under the hood)
95+
96+
1. **Input parsing**
97+
98+
* `-d/--domain` for a single host
99+
* `-l/--list` for multiple hosts (one per line, supports comments and basic normalization)
100+
101+
2. **Query the Wayback CDX API**
102+
103+
* Requests are sent to the CDX endpoint (Wayback Machine)
104+
* Uses `matchType=host` by default, or `matchType=domain` when `--include-subdomains` is enabled
105+
* Uses pagination (resumeKey) when the API provides it
106+
107+
3. **Filter “boring” URLs**
108+
109+
* Drops URLs that look like static assets (by extension), with optional whitelist mode
110+
* Optional regex filters can exclude paths (e.g., `/static/`, `/assets/`, …)
111+
112+
4. **Canonicalize + normalize**
113+
114+
* Removes fragments (`#...`)
115+
* Normalizes default ports (`:80`, `:443`)
116+
* Parses query string and:
117+
118+
* replaces values with a placeholder (default `FUZZ`)
119+
* optionally drops tracking parameters
120+
* sorts parameters for stable output
121+
* Deduplicates results
122+
123+
5. **Output**
124+
125+
* By default writes per-domain results into `results/`
126+
* `--stdout` streams machine-readable output
127+
* Diagnostics (hints, logs, stats) go to **stderr** (safe for pipelines)
128+
129+
---
130+
131+
## Output behavior (important for pipelines)
132+
133+
* **stdout**: only results (URLs or JSONL) when `--stdout` is enabled
134+
* **stderr**: logs, errors, hints (VPN/proxy), optional stats
135+
136+
This means you can safely do:
137+
138+
```bash
139+
wayparam -d example.com --stdout --no-files | sort -u > urls.txt
140+
```
141+
142+
---
143+
144+
## Common options
145+
146+
### Wayback/CDX
147+
148+
* `--include-subdomains`
149+
* `--from 2019` / `--to 2021` (or full timestamps like `20190101000000`)
150+
* `--filter statuscode:200` (repeatable)
151+
* `--no-collapse` (more duplicates, more data)
152+
153+
### Normalization
154+
155+
* `--placeholder X`
156+
* `--keep-values` (not recommended if you share logs)
157+
* `--drop-tracking` / `--no-drop-tracking`
158+
* `--all-urls` (include URLs without query parameters)
159+
160+
### Filtering
161+
162+
* `--ext-blacklist ".png,.jpg,.css,.js"`
163+
* `--ext-whitelist ".php,.asp,.aspx"`
164+
* `--exclude-path-regex "regex"` (repeatable)
165+
166+
### Performance / network
167+
168+
* `--concurrency 8`
169+
* `--rps 1` (recommended when using VPNs / noisy networks)
170+
* `--timeout 30`
171+
* `--retries 4`
172+
* `--proxy http://127.0.0.1:8080`
173+
174+
---
175+
176+
## Troubleshooting: VPN / Proxy issues (Wayback CDX)
177+
178+
If you see errors like “failed after retries” against the CDX endpoint, it often means:
179+
180+
* the VPN/proxy exit node is **blocked** or **rate-limited** by Wayback
181+
* your VPN does TLS filtering or networking policies that break automated requests
182+
183+
Try:
184+
185+
* disconnecting VPN/proxy and rerunning
186+
* switching to a different VPN server
187+
* lowering `--concurrency` and setting `--rps 1`
188+
189+
wayparam will print a **human-readable hint in English** to stderr when it detects this pattern.
190+
191+
---
192+
193+
## Man page
194+
195+
A manual page is included:
196+
197+
```bash
198+
man ./man/wayparam.1
199+
```
200+
201+
---
202+
203+
## Testing
204+
205+
Install dev dependencies and run:
206+
207+
```bash
208+
pip install -e ".[dev]"
209+
pytest -q
210+
```
211+
212+
The test suite includes **httpx-level integration tests** using `httpx.MockTransport` (no network).
213+
214+
---
215+
216+
## License
217+
218+
wayparam is **free software** released under the **GNU General Public License v3 (GPLv3)**.
219+
See the `LICENSE` file for details.
220+
221+
---
222+
223+
## Acknowledgements
224+
225+
* Inspired by **ParamSpider** (same objective: fetch Wayback URLs, filter noise, focus on parameterized endpoints).
226+
* Thanks to the OSINT / security community for patterns and workflows around URL collection and parameter discovery.
227+
228+
---
229+
230+
## Disclaimer
231+
232+
Use responsibly and lawfully. This tool queries the Internet Archive and does not actively scan targets, but your downstream usage of collected URLs may have legal and ethical implications depending on context.

SECURITY.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Security Policy
2+
3+
Report security issues via the project issue tracker (private disclosure if possible).
4+
5+
This tool queries the Internet Archive (Wayback CDX). It does not crawl targets.

0 commit comments

Comments
 (0)