This repository was archived by the owner on Apr 20, 2019. It is now read-only.

Link

Jump to bottom

temoto edited this page Sep 13, 2010 · 4 revisions

Link (or URL metadata) is a set of info about URL

It consists of:

URL
Last visited timestamp
Response headers
Content pointer — where response body is located. At the time, since Amazon S3 is considered to store URLs content, this pointer is S3 bucket key.

New URL

When manager finds out that the URL was never crawled before (by making a request to the database), it creates a new Link with

URL
Last visited timestamp — None
Response headers — None
Content-pointer — None

and appends it to crawling queue, so next worker asking for job would get it.

Known URL

If manager finds out that the URL was crawled before, it creates a Link with

URL
Last visited timestamp — from database
Response headers — from database
Content-pointer — from database

and appends it to crawling queue, so next worker asking for job would get it.