Skip to content

Latest commit

 

History

History
447 lines (324 loc) · 11.3 KB

File metadata and controls

447 lines (324 loc) · 11.3 KB

Section Expansion Guide

🚀 How to Add New Voynich Manuscript Sections

This guide explains how to expand the translation system to cover new sections of the Voynich Manuscript beyond the initial Herbal A and B sections.

Created: November 27, 2025
Last Updated: Post-Iteration 6
Status: ✅ Fully Automated Pipeline


📚 Overview

The Voynich Manuscript consists of multiple sections/quires:

  • q01-q07: Herbal A (folios f1-f56)
  • q08-q10: Herbal B / Pharmaceutical (folios f57-f84)
  • q11, q13: Pharmaceutical (folios f85-f90)
  • q14: Text-only "Recipes" (folios f103-f116)
  • q15: Astronomical/Astrological (folios f67-f70+89)
  • q20: Stars/Cosmological (circular diagrams)

Source: All transcriptions available at voynich.nu


🛠️ Automated Pipeline (3 Steps)

Step 1: Scrape Transcriptions

Use scrape_voynich_nu.py to download EVA transcriptions:

# Scrape a single quire
python scrape_voynich_nu.py --quire q03

# Scrape multiple quires
python scrape_voynich_nu.py --quire q04 q05 q06 q07

# Scrape all available quires
python scrape_voynich_nu.py --all

# List available quires first
python scrape_voynich_nu.py --list-quires

What it does:

  • Connects to voynich.nu
  • Finds all _tr.txt transliteration files for the quire
  • Downloads to data/scraped/{quire}/
  • Creates a manifest file
  • No temp files created - all outputs saved permanently

Output: data/scraped/q03/f017r_tr.txt, f017v_tr.txt, etc.


Step 2: Copy to Folios Directory

The scraped files (raw EVA format) need to be copied to the main folios directory with proper naming:

# Copy all q03 files with correct naming
cd data/scraped/q03
for file in *_tr.txt; do 
    folio=$(basename "$file" _tr.txt)
    cp "$file" "../../folios/q03_$folio.txt"
done

# Verify
ls -1 ../../folios/q03_*.txt | wc -l

Important:

  • ✅ Use raw EVA files (with headers and comments)
  • Do NOT use parsed/cleaned files from parse_transcriptions.py
  • The translator handles EVA format internally

Naming Convention:

  • Source: f017r_tr.txt
  • Destination: q03_f017r.txt (in main folios directory, not subdirectory)

Step 3: Update Metadata

Add entries to data/folios/metadata.json for each new folio:

import json
from pathlib import Path
from datetime import datetime

# Load existing metadata
with open('data/folios/metadata.json') as f:
    metadata = json.load(f)

# Add new quire folios
for folio_file in sorted(Path('data/folios').glob('q03_f*.txt')):
    folio_id = folio_file.stem.replace('q03_f', '')
    key = f'q03_f{folio_id}'
    
    # Count words (from EVA transcription)
    lines = folio_file.read_text().splitlines()
    word_count = sum(len(line.split()) for line in lines 
                     if line.strip() and not line.startswith('#'))
    
    metadata[key] = {
        'folio_id': folio_id,
        'section': 'Herbal A',  # Adjust based on quire
        'word_count': word_count,
        'downloaded_at': datetime.now().isoformat(),
        'file': str(folio_file)
    }

# Save
with open('data/folios/metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

Section Mapping:

  • q01-q07: "Herbal A"
  • q08-q10: "Herbal B"
  • q11, q13: "Pharmaceutical"
  • q14: "Recipes"
  • q15: "Astronomical"
  • q20: "Cosmological"

✅ Testing New Sections

Translate a Single Folio

python translate_folio.py --section q03 --folio 017r --context herbal

Batch Translate Entire Quire

python translate_folio.py --section q03 --start 17 --end 24 --context herbal

Expected Results

First translation on new sections:

  • Coverage: 40-50% (using existing 748-word dictionary)
  • Unknown words: High (expected - new vocabulary)
  • Confidence: 0.35-0.45 average

After vocabulary iteration:

  • Coverage improves to 55-65% range
  • Pattern similar to Herbal A progression

🎯 Practical Example: Adding Q03

Complete Workflow

# 1. Scrape q03
python scrape_voynich_nu.py --quire q03 --output-dir data/scraped

# 2. Copy raw EVA files to folios directory (NOT subdirectory!)
cd data/scraped/q03
for file in *_tr.txt; do 
    folio=$(basename "$file" _tr.txt)
    cp "$file" "../../folios/q03_$folio.txt"
done
cd ../../..

# Verify files are in the right place
ls -1 data/folios/q03_f*.txt | head -3
# Expected: data/folios/q03_f017r.txt (NOT data/folios/q03/f017r.txt)

# 3. Update metadata (use Python script from Step 3 above)
python -c "
import json
from pathlib import Path
from datetime import datetime

with open('data/folios/metadata.json') as f:
    metadata = json.load(f)

for folio_file in sorted(Path('data/folios').glob('q03_f*.txt')):
    folio_id = folio_file.stem.replace('q03_f', '')
    key = f'q03_f{folio_id}'
    
    lines = folio_file.read_text().splitlines()
    word_count = sum(len(line.split()) for line in lines 
                     if line.strip() and not line.startswith('#'))
    
    metadata[key] = {
        'folio_id': folio_id,
        'section': 'Herbal A',
        'word_count': word_count,
        'downloaded_at': datetime.now().isoformat(),
        'file': str(folio_file)
    }

with open('data/folios/metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)
print(f'✅ Added {len([k for k in metadata if k.startswith(\"q03\")])} q03 folios')
"

# 4. Test translation
python translate_folio.py --section q03 --folio 017r --context herbal

# 5. View results
python translate_folio.py --section q03 --show 017r

Result: ✅ Successfully translated q03_f017r with 46.7% coverage!

Cleanup:

  • ✅ No temp files created
  • ✅ Scraped files remain in data/scraped/ for reference
  • ✅ Working files in data/folios/q03_*.txt (flat structure)

📦 Available Tools

1. scrape_voynich_nu.py

  • Purpose: Download EVA transcriptions
  • Source: https://www.voynich.nu
  • Dependencies: beautifulsoup4, requests
  • Features:
    • Automatic quire discovery
    • Polite rate limiting (0.5s delay)
    • Skip already-downloaded files
    • Progress tracking
    • Manifest generation

2. parse_transcriptions.py (Optional - Not Needed for Translation)

  • Purpose: Parse EVA format into clean text
  • ⚠️ Skip this tool - Not used in translation workflow
  • Useful for: Text analysis, linguistic research, standalone corpus
  • Note: The translator (translator.py) handles EVA parsing internally

3. translate_folio.py

  • Purpose: Translate folios using dictionary
  • Handles: EVA format parsing automatically
  • Output: JSON with Latin and English translations

🗺️ Expansion Roadmap

Priority 1: Complete Herbal Sections (Recommended Next)

# Quires 3-7 (Herbal A continuation)
python scrape_voynich_nu.py --quire q03 q04 q05 q06 q07

# Quires 8-10 (Herbal B / Pharmaceutical)
python scrape_voynich_nu.py --quire q08 q09 q10

Why: Similar botanical content to existing dictionary

Expected Effort:

  • Scraping: 5 minutes per quire
  • Translation: 10 minutes per quire
  • Vocabulary iteration: 1-2 hours per major section group

Priority 2: Pharmaceutical Sections

python scrape_voynich_nu.py --quire q11 q13

Why: Related to herbal medicine

Challenge: May introduce new vocabulary domain

Priority 3: Astronomical/Cosmological

python scrape_voynich_nu.py --quire q15 q20

Why: Completely different subject matter

Challenge:

  • New vocabulary domain
  • Diagram-heavy (less pure text)
  • May require specialized dictionary

Priority 4: Recipes Section

python scrape_voynich_nu.py --quire q14

Why: Text-only section (no illustrations)

Challenge: Dense text, possibly different language style


🔄 Iterative Vocabulary Extension

When adding new sections with low coverage:

  1. Translate new section
  2. Analyze unknown words (use analyze_gaps.py)
  3. Identify high-frequency unknowns
  4. Run iteration workflow (see WORKFLOW_INSTRUCTIONS.md)
  5. Re-translate with expanded dictionary
  6. Repeat until coverage plateaus

Expected Coverage Growth:

  • Initial: 40-50% (with existing dictionary)
  • After 1-2 iterations: 55-65%
  • Plateau: 60-70% (systematic vocabulary complete)

🚨 Important Notes

File Format

  • Use raw EVA transcription files (with headers and metadata)
  • Do NOT use cleaned/parsed files for translation
  • The translator (translator.py) handles EVA parsing internally

Metadata Fields

Required in metadata.json:

{
  "q03_f017r": {
    "folio_id": "017r",
    "section": "Herbal A",
    "word_count": 274,
    "downloaded_at": "2025-11-27T21:45:00",
    "file": "data/folios/q03_f017r.txt"
  }
}

Section Context

For translation, specify appropriate context:

  • herbal: Botanical sections (q01-q10)
  • pharmaceutical: Medicine sections (q11, q13)
  • astronomical: Star charts (q15, q20)
  • recipes: Text-only section (q14)

Context affects polysemy resolution in the dictionary.


📊 Success Metrics

After Adding New Section

Immediate (before iteration):

  • ✅ Files downloaded and copied
  • ✅ Metadata updated
  • ✅ Translation runs without errors
  • ✅ Coverage: 40-50% typical

After 1-2 Iterations:

  • ✅ Coverage: 55-65%
  • ✅ High-frequency unknowns resolved
  • ✅ Morphological patterns discovered

Project-wide:

  • ✅ Dictionary grows by 20-40 words per section
  • ✅ Overall system coverage improves
  • ✅ Cross-section consistency maintained

💡 Pro Tips

1. Batch Processing

Process quires in logical groups:

  • All Herbal A (q01-q07) together
  • All Herbal B (q08-q10) together
  • This allows vocabulary learning across related sections

2. Validate Before Iteration

python scripts/validation_checker.py --check-type all

3. Track Unknown Words

python scripts/word_frequency.py --min-freq 3 --section q03

4. Compare Sections

Use the same folios across quires to track consistency:

  • f1r, f1v (first page pattern)
  • Last pages (conclusion patterns)

🔍 Troubleshooting

Issue: "Folio not found"

Solution: Ensure metadata.json has entry for the folio

Issue: "Total words: 0"

Solution: File might be in wrong format - use raw EVA files

Issue: "Very low coverage (< 30%)"

Solution: Expected for new sections - run vocabulary iteration

Issue: "Translation errors on special folios"

Solution: Some folios (like circular diagrams) may have non-standard format


📝 Next Steps After Expansion

  1. Run gap analysis on new sections
  2. Identify section-specific vocabulary
  3. Run iteration to extend dictionary
  4. Update section mapping in voynich.yaml polysemy
  5. Document new patterns discovered

🎯 Goal

Ultimate Objective: Cover all available Voynich Manuscript transcriptions with systematic translation

Current Progress:

  • ✅ Q01-Q02: Complete (22 folios, 60.4% avg coverage)
  • ✅ Q03: Validated (16 folios, testing complete)
  • ⏳ Q04-Q20: Available for expansion

Estimated Total: ~240 folios across all quires


Status:Pipeline Operational
Ready for: Large-scale section expansion
Automation Level: High (3-step process)


Guide created: November 27, 2025
Successfully validated with Q03 expansion
Tools: scrape_voynich_nu.py, translate_folio.py