The build_db.py script implements efficient scraping with per-award caching. This means you only re-scrape the awards you need, and load everything else from cache.
- Individual Caches: Each award is saved to its own file in
data/cache/(e.g.,hugo.csv,giller.csv) - Selective Scraping: Specify which awards to re-scrape
- Auto-Loading: Awards not being scraped are loaded from cache
- No Redundancy: Stop hitting Wikipedia/Wikidata for data you already have!
python3 build_db.py --list-awardsThis shows all available awards and whether they're cached:
Available awards:
hugo ✓ cached
nebula ✓ cached
giller ✗ not cached
...
# Scrape only Giller (load everything else from cache)
python3 build_db.py --scrape giller
# Scrape multiple awards
python3 build_db.py --scrape giller,booker,hugo
# Scrape everything (initial cache building)
python3 build_db.py --scrape-all# Build database from cached data (no scraping)
python3 build_db.py# Cache all awards initially
python3 build_db.py --scrape-allThis takes ~10 minutes but only needs to be done once.
# Re-scrape just Giller, keep everything else from cache
python3 build_db.py --scrape gillerThis takes ~5 seconds instead of 10 minutes!
If you got rate-limited on Booker and International Booker:
# Wait 30 minutes, then:
python3 build_db.py --scrape booker,international_bookerThis only re-scrapes those two awards and loads the other 9 from cache.
# Just rebuild the combined CSV from cached data
python3 build_db.pyUseful if you want to change normalization or deduplication logic.
The following awards can be scraped individually:
Wikidata-based (SPARQL queries):
hugo- Hugo Award for Best Novelnebula- Nebula Award for Best Novelclarke- Arthur C. Clarke Awardbooker- Booker Prizeinternational_booker- International Booker Prizelocus_scifi- Locus Award for Science Fiction Novellocus_fantasy- Locus Award for Fantasy Novellocus_horror- Locus Award for Horror Novel
Wikipedia-based (HTML scraping):
agatha- Agatha Award (all novel categories)national_book_award- National Book Award for Fictiongiller- Giller Prize
- Speed: Selective scraping is 100x faster
- Efficiency: Don't re-download data you already have
- Resilience: Rate limiting only affects one award
- Flexibility: Easy to update just what's changed
The legacy build script has been removed. Use python3 build_db.py (with optional --scrape or --scrape-all) for all scraping and cache refreshes.
- Location:
data/cache/*.csv - Format: Same as canonical database (prize, category, year, status, title, author)
- Clearing: Delete
data/cache/to force a full rebuild - Per-award: Delete specific files to re-scrape just those awards