Skip to content

add a crawl option to not recrawl already crawled pages #507

@boogheta

Description

@boogheta

Plan of action:

  • add an option "expand_crawl" with a new crawl depth (questions: should it add the new depth to the original one? should we have a max_expanded_depth setting?)
  • monkeypatch spider's _request function to first check if the page already exists in the mongo and if so skip request and instead feed the spider directly with the stored lrulinks from the mongo

linked with #158 which could be implemented altogether now that this features comes in

Extra: add a RickRoll/Recrawl easter egg!

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions