Skip to content
This repository was archived by the owner on Apr 20, 2019. It is now read-only.
temoto edited this page Sep 13, 2010 · 4 revisions

Link (or URL metadata) is a set of info about URL

It consists of:

  • URL
  • Last visited timestamp
  • Response headers
  • Content pointer — where response body is located. At the time, since Amazon S3 is considered to store URLs content, this pointer is S3 bucket key.

New URL

When manager finds out that the URL was never crawled before (by making a request to the database), it creates a new Link with

  • URL
  • Last visited timestamp — None
  • Response headers — None
  • Content-pointer — None

and appends it to crawling queue, so next worker asking for job would get it.

Known URL

If manager finds out that the URL was crawled before, it creates a Link with

  • URL
  • Last visited timestamp — from database
  • Response headers — from database
  • Content-pointer — from database

and appends it to crawling queue, so next worker asking for job would get it.

Clone this wiki locally