-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
todoThis should be implemented, is planned and a necessity, therefor not an enhancement.This should be implemented, is planned and a necessity, therefor not an enhancement.
Description
For every new discovered host, check for a robots.txt.
Then, for every URL, we need to check whether it is allowed to access or not depending on the robots.txt.
This can be either done during insertion time or during dispatch time. For single runs, both options are pretty much identical, however, for longer, multi-scrape runs this can have some differences, as robots.txt might change.
- Multiple subdomains can have a robots.txt. As we store the subdomain as part of the path, we have a more costly lookup to do, to find out if we already have a robots.txt path for a given subdomain
- Check if we are allowed to access a given url. Check needs to be done with the subdomain in mind, see first point
- Download the robots.txt lazy: we see that not yet a path exists with a robots.txt for the given subdomain, we should add it and eagerly get it, before continuing the download. Check that no timeouts appear, as we now do two downloads instead of a single one, hence occupy a network slot at most twice as long
Metadata
Metadata
Assignees
Labels
todoThis should be implemented, is planned and a necessity, therefor not an enhancement.This should be implemented, is planned and a necessity, therefor not an enhancement.