Skip to content

Respect robots.txt #37

@jogli5er

Description

@jogli5er

For every new discovered host, check for a robots.txt.
Then, for every URL, we need to check whether it is allowed to access or not depending on the robots.txt.
This can be either done during insertion time or during dispatch time. For single runs, both options are pretty much identical, however, for longer, multi-scrape runs this can have some differences, as robots.txt might change.

  • Multiple subdomains can have a robots.txt. As we store the subdomain as part of the path, we have a more costly lookup to do, to find out if we already have a robots.txt path for a given subdomain
  • Check if we are allowed to access a given url. Check needs to be done with the subdomain in mind, see first point
  • Download the robots.txt lazy: we see that not yet a path exists with a robots.txt for the given subdomain, we should add it and eagerly get it, before continuing the download. Check that no timeouts appear, as we now do two downloads instead of a single one, hence occupy a network slot at most twice as long

Metadata

Metadata

Assignees

Labels

todoThis should be implemented, is planned and a necessity, therefor not an enhancement.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions