The scrape_voynich_nu.py script:
- β Creates NO temporary files
- β All downloads are saved directly to final locations
- β No cleanup required after scraping
- β Safe for interruption (resume-friendly)
data/scraped/
βββ q03/
β βββ f017r_tr.txt β Raw EVA format (permanent)
β βββ f017v_tr.txt
β βββ ...
βββ q04/
βββ scrape_manifest.json β Metadata (permanent)
-
During Scraping
data/scraped/{quire}/*.txt- Downloaded transcriptionsdata/scraped/scrape_manifest.json- Download metadata
-
For Translation (manual copy)
data/folios/q03_f*.txt- Working copies in flat structuredata/folios/metadata.json- Updated with new entries
-
After Translation
data/translations/q03_f*_translation.json- Translation results
β Do NOT create subdirectories in data/folios/
- Wrong:
data/folios/q03/f017r.txt - Right:
data/folios/q03_f017r.txt
β Do NOT use parsed/cleaned text files
- The translator needs raw EVA format with headers
- Skip
parse_transcriptions.pyfor translation workflow
Keep:
- β
data/scraped/- Source files for reference - β
data/folios/q03_*.txt- Working files for translation - β
data/translations/- Translation results - β
data/folios/metadata.json- Required for translation
Can Delete (if needed to save space):
data/scraped/{quire}/- After copying to folios directory- But recommended to keep for re-downloading if files get corrupted
The scraper automatically:
- Skips already-downloaded files
- Can be interrupted and restarted
- No partial downloads (writes complete file at once)
- No file locks or temp states
# Safe to interrupt and restart
python scrape_voynich_nu.py --quire q03
# Ctrl+C
python scrape_voynich_nu.py --quire q03 # Resumes, skips existingRecommended to keep data/scraped/ as a cache:
- Faster re-copying if needed
- Reference for original transcriptions
- No need to re-download from voynich.nu
Always maintain flat structure in data/folios/:
# Correct
data/folios/q03_f017r.txt
data/folios/q03_f017v.txt
# Wrong (translator won't find these)
data/folios/q03/f017r.txtAfter adding new folios:
# Always update metadata.json
# Use the script from SECTION_EXPANSION_GUIDE.mdSymptom: FileNotFoundError: Folio not found: q03/f017r
Cause: Files in data/folios/q03/ instead of data/folios/
Fix:
cd data/folios
rm -rf q03/ # Remove subdirectory
# Re-copy from scraped with correct namingSymptom: Total words: 0 or translation errors
Cause: Using cleaned text instead of raw EVA format
Fix:
# Use raw files from data/scraped/, not parse_transcriptions.py output
cp data/scraped/q03/f017r_tr.txt data/folios/q03_f017r.txtPer quire (approximate):
- Scraped files: ~100-200 KB
- Working files (folios): ~100-200 KB (duplicate)
- Translation results: ~500 KB - 2 MB (JSON with full word list)
Total for all ~16 quires: < 50 MB
Created: November 27, 2025
Status: β
Production-ready, no cleanup needed
Maintenance: None required