Respect robots.txt

For every new discovered host, check for a robots.txt.
Then, for every URL, we need to check whether it is allowed to access or not depending on the robots.txt.
This can be either done during insertion time or during dispatch time. For single runs, both options are pretty much identical, however, for longer, multi-scrape runs this can have some differences, as robots.txt might change.

- [ ]  Multiple subdomains can have a robots.txt. As we store the subdomain as part of the path, we have a more costly lookup to do, to find out if we already have a robots.txt path for a given subdomain
- [ ]  Check if we are allowed to access a given url. Check needs to be done with the subdomain in mind, see first point
- [ ]  Download the robots.txt lazy: we see that not yet a path exists with a robots.txt for the given subdomain, we should add it and eagerly get it, before continuing the download. Check that no timeouts appear, as we now do two downloads instead of a single one, hence occupy a network slot at most twice as long


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect robots.txt #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Respect robots.txt #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions