Storage

Temporary storage

Workers store downloaded pages in hash-clusterized directory. Just like git stores objects.

.heroshi/page/
index
17 / 20 / ad806beb07813b56ed02e03644d7d82ed5c5
17 / be / 8a410deb09af0982164e9e7098798b7022ad
33 / 47 / 60ed9022efd684068a4c8e21086fe834b69b
34 / db / c3803cb36b49ffb58e269475816b8c2bcbc1
a0 / 13 / 0f4ffee25ccabd31ca61583c41e58e8799f2

Filenames are sha1 hashes of URLs crawled. index file is JSON encoded list of Link.

Permanent storage

Workers store crawled pages along with meta-data through queue manager in special storage.

At the time of writing, i’m considering Amazon SimpleDB for "metadata":Link and Amazon S3 for actual URL content.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage

Temporary storage

Permanent storage

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally