Skip to content

Latest commit

Β 

History

History
142 lines (106 loc) Β· 3.59 KB

File metadata and controls

142 lines (106 loc) Β· 3.59 KB

Scraper Implementation Notes

βœ… File Management

No Temp Files

The scrape_voynich_nu.py script:

  • βœ… Creates NO temporary files
  • βœ… All downloads are saved directly to final locations
  • βœ… No cleanup required after scraping
  • βœ… Safe for interruption (resume-friendly)

Output Structure

data/scraped/
β”œβ”€β”€ q03/
β”‚   β”œβ”€β”€ f017r_tr.txt  ← Raw EVA format (permanent)
β”‚   β”œβ”€β”€ f017v_tr.txt
β”‚   └── ...
β”œβ”€β”€ q04/
└── scrape_manifest.json  ← Metadata (permanent)

πŸ“ File Flow

What Gets Created

  1. During Scraping

    • data/scraped/{quire}/*.txt - Downloaded transcriptions
    • data/scraped/scrape_manifest.json - Download metadata
  2. For Translation (manual copy)

    • data/folios/q03_f*.txt - Working copies in flat structure
    • data/folios/metadata.json - Updated with new entries
  3. After Translation

    • data/translations/q03_f*_translation.json - Translation results

What NOT to Create

❌ Do NOT create subdirectories in data/folios/

  • Wrong: data/folios/q03/f017r.txt
  • Right: data/folios/q03_f017r.txt

❌ Do NOT use parsed/cleaned text files

  • The translator needs raw EVA format with headers
  • Skip parse_transcriptions.py for translation workflow

πŸ—‘οΈ What Can Be Deleted

After Successful Translation

Keep:

  • βœ… data/scraped/ - Source files for reference
  • βœ… data/folios/q03_*.txt - Working files for translation
  • βœ… data/translations/ - Translation results
  • βœ… data/folios/metadata.json - Required for translation

Can Delete (if needed to save space):

  • data/scraped/{quire}/ - After copying to folios directory
  • But recommended to keep for re-downloading if files get corrupted

πŸ”„ Resume-Friendly Design

The scraper automatically:

  • Skips already-downloaded files
  • Can be interrupted and restarted
  • No partial downloads (writes complete file at once)
  • No file locks or temp states
# Safe to interrupt and restart
python scrape_voynich_nu.py --quire q03
# Ctrl+C
python scrape_voynich_nu.py --quire q03  # Resumes, skips existing

πŸ“ Best Practices

1. Keep Scraped Files

Recommended to keep data/scraped/ as a cache:

  • Faster re-copying if needed
  • Reference for original transcriptions
  • No need to re-download from voynich.nu

2. Flat Folios Structure

Always maintain flat structure in data/folios/:

# Correct
data/folios/q03_f017r.txt
data/folios/q03_f017v.txt

# Wrong (translator won't find these)
data/folios/q03/f017r.txt

3. Metadata Sync

After adding new folios:

# Always update metadata.json
# Use the script from SECTION_EXPANSION_GUIDE.md

πŸ› Troubleshooting

Issue: Files in subdirectories not found

Symptom: FileNotFoundError: Folio not found: q03/f017r

Cause: Files in data/folios/q03/ instead of data/folios/

Fix:

cd data/folios
rm -rf q03/  # Remove subdirectory
# Re-copy from scraped with correct naming

Issue: Parsed files not working

Symptom: Total words: 0 or translation errors

Cause: Using cleaned text instead of raw EVA format

Fix:

# Use raw files from data/scraped/, not parse_transcriptions.py output
cp data/scraped/q03/f017r_tr.txt data/folios/q03_f017r.txt

πŸ“Š Storage Requirements

Per quire (approximate):

  • Scraped files: ~100-200 KB
  • Working files (folios): ~100-200 KB (duplicate)
  • Translation results: ~500 KB - 2 MB (JSON with full word list)

Total for all ~16 quires: < 50 MB


Created: November 27, 2025
Status: βœ… Production-ready, no cleanup needed
Maintenance: None required