Summary
Respect robots.txt and leverage sitemaps for clearnet crawls, with an explicit override for Tor research.
Motivation
- Follow industry norms by default
- Allow explicit override for onion space
Scope
- Robots parse + cache per host
- Merge with allow/deny host rules
--ignore-robots flag (default: false)
- Optional sitemap discovery as seeds
Acceptance Criteria
- Known test sites are respected
--ignore-robots bypasses checks when set
Tasks