Skip to content
This repository was archived by the owner on Apr 20, 2019. It is now read-only.

Storage

temoto edited this page Sep 13, 2010 · 9 revisions

Temporary storage

Workers store downloaded pages in hash-clusterized directory. Just like git stores objects.

.heroshi/page/
index
17 / 20 / ad806beb07813b56ed02e03644d7d82ed5c5
17 / be / 8a410deb09af0982164e9e7098798b7022ad
33 / 47 / 60ed9022efd684068a4c8e21086fe834b69b
34 / db / c3803cb36b49ffb58e269475816b8c2bcbc1
a0 / 13 / 0f4ffee25ccabd31ca61583c41e58e8799f2

Filenames are sha1 hashes of URLs crawled. index file is JSON encoded list of Link.

Permanent storage

Workers store crawled pages along with meta-data through queue manager in special storage.

At the time of writing, i’m considering Amazon SimpleDB for "metadata":Link and Amazon S3 for actual URL content.

Clone this wiki locally