This repository was archived by the owner on Apr 20, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 5
Link
temoto edited this page Sep 13, 2010
·
4 revisions
It consists of:
- URL
- Last visited timestamp
- Response headers
- Content pointer — where response body is located. At the time, since Amazon S3 is considered to store URLs content, this pointer is S3 bucket key.
When manager finds out that the URL was never crawled before (by making a request to the database), it creates a new Link with
- URL
- Last visited timestamp — None
- Response headers — None
- Content-pointer — None
and appends it to crawling queue, so next worker asking for job would get it.
If manager finds out that the URL was crawled before, it creates a Link with
- URL
- Last visited timestamp — from database
- Response headers — from database
- Content-pointer — from database
and appends it to crawling queue, so next worker asking for job would get it.