robots.txt + sitemap (clearnet) with opt-out for .onion

## Summary
Respect robots.txt and leverage sitemaps for clearnet crawls, with an explicit override for Tor research.

## Motivation
- Follow industry norms by default
- Allow explicit override for onion space

## Scope
- Robots parse + cache per host
- Merge with allow/deny host rules
- `--ignore-robots` flag (default: false)
- Optional sitemap discovery as seeds

## Acceptance Criteria
- Known test sites are respected
- `--ignore-robots` bypasses checks when set

## Tasks
- [ ] Implement robots logic + cache
- [ ] Sitemap discovery (opt-in)
- [ ] Tests against public test robots


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

robots.txt + sitemap (clearnet) with opt-out for .onion #70

Summary

Motivation

Scope

Acceptance Criteria

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

robots.txt + sitemap (clearnet) with opt-out for .onion #70

Description

Summary

Motivation

Scope

Acceptance Criteria

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions