Efficient Scraping with Caching

The build_db.py script implements efficient scraping with per-award caching. This means you only re-scrape the awards you need, and load everything else from cache.

How It Works

Individual Caches: Each award is saved to its own file in data/cache/ (e.g., hugo.csv, giller.csv)
Selective Scraping: Specify which awards to re-scrape
Auto-Loading: Awards not being scraped are loaded from cache
No Redundancy: Stop hitting Wikipedia/Wikidata for data you already have!

Usage

List Available Awards

python3 build_db.py --list-awards

This shows all available awards and whether they're cached:

Available awards:
  hugo                      ✓ cached
  nebula                    ✓ cached
  giller                    ✗ not cached
  ...

Scrape Specific Awards

# Scrape only Giller (load everything else from cache)
python3 build_db.py --scrape giller

# Scrape multiple awards
python3 build_db.py --scrape giller,booker,hugo

# Scrape everything (initial cache building)
python3 build_db.py --scrape-all

Load from Cache Only

# Build database from cached data (no scraping)
python3 build_db.py

Examples

First Time Setup

# Cache all awards initially
python3 build_db.py --scrape-all

This takes ~10 minutes but only needs to be done once.

Update Just One Award

# Re-scrape just Giller, keep everything else from cache
python3 build_db.py --scrape giller

This takes ~5 seconds instead of 10 minutes!

Recover from Rate Limiting

If you got rate-limited on Booker and International Booker:

# Wait 30 minutes, then:
python3 build_db.py --scrape booker,international_booker

This only re-scrapes those two awards and loads the other 9 from cache.

Rebuild Entire Database from Cache

# Just rebuild the combined CSV from cached data
python3 build_db.py

Useful if you want to change normalization or deduplication logic.

Available Awards

The following awards can be scraped individually:

Wikidata-based (SPARQL queries):

hugo - Hugo Award for Best Novel
nebula - Nebula Award for Best Novel
clarke - Arthur C. Clarke Award
booker - Booker Prize
international_booker - International Booker Prize
locus_scifi - Locus Award for Science Fiction Novel
locus_fantasy - Locus Award for Fantasy Novel
locus_horror - Locus Award for Horror Novel

Wikipedia-based (HTML scraping):

agatha - Agatha Award (all novel categories)
national_book_award - National Book Award for Fiction
giller - Giller Prize

Benefits

Speed: Selective scraping is 100x faster
Efficiency: Don't re-download data you already have
Resilience: Rate limiting only affects one award
Flexibility: Easy to update just what's changed

Single Source of Truth

The legacy build script has been removed. Use python3 build_db.py (with optional --scrape or --scrape-all) for all scraping and cache refreshes.

Cache Management

Location: data/cache/*.csv
Format: Same as canonical database (prize, category, year, status, title, author)
Clearing: Delete data/cache/ to force a full rebuild
Per-award: Delete specific files to re-scrape just those awards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient Scraping with Caching

How It Works

Usage

List Available Awards

Scrape Specific Awards

Load from Cache Only

Examples

First Time Setup

Update Just One Award

Recover from Rate Limiting

Rebuild Entire Database from Cache

Available Awards

Benefits

Single Source of Truth

Cache Management

FilesExpand file tree

EFFICIENT_SCRAPING.md

Latest commit

History

EFFICIENT_SCRAPING.md

File metadata and controls

Efficient Scraping with Caching

How It Works

Usage

List Available Awards

Scrape Specific Awards

Load from Cache Only

Examples

First Time Setup

Update Just One Award

Recover from Rate Limiting

Rebuild Entire Database from Cache

Available Awards

Benefits

Single Source of Truth

Cache Management