Skip to content

autogram-is/crawl-configs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crawl Configs

This repository is a collection of Spidergram configurations for various crawling projects. Anyone at Autogram is welcome to tinker with it, add new examples, and so on.

Setting up Spidergram

  1. Node.js (brew install node; brew install nvm; nvm install latest; nvm default latest)
  2. Containerized ArangoDB (brew install docker docker-compose)
  3. Spidergram proper (npm install -g spidergram)
  4. Gin

Setting up a crawl

This repository is set up to ignore the storage and output directories that Spidergram uses to store crawled data, downloaded files, and generated reports. It also ignores any files named arango.config.*, so you can stick database credentials there if you're not using a local docker container.

This setup also means you can run crawls in these directories without worrying that gigs of data will accidentally get checked into the repository.

Accessing a shared crawl

  • Get the DB credentials
  • Get the storage archive

Running a local crawl

  • Starting Arango in Docker

The crawl list

  • ethan: Ethan Marcotte's home page. We know it and love it.
  • wiad: World IA Day. Around 3000 pages total, with its blog hanging out on a separate Medium.com account.
  • schwab: A dozen or two subsites with ~7,000 crawled pages.
  • va: The Veteran's Administration, a government agency with about 200 distinct subdomains, 100k+ crawled HTML pages, and 50k or so downloaded files.
  • uli: The Urban Land Institute, with piles of local and SIG subsites and high profile magazine site used as a hub. These settings are copied over from the verrrry old 0.5.0 era custom code we used to do the first crawl.

About

Spidergram configuration files for presales and internal projects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published