Section Expansion Guide

🚀 How to Add New Voynich Manuscript Sections

This guide explains how to expand the translation system to cover new sections of the Voynich Manuscript beyond the initial Herbal A and B sections.

Created: November 27, 2025
Last Updated: Post-Iteration 6
Status: ✅ Fully Automated Pipeline

📚 Overview

The Voynich Manuscript consists of multiple sections/quires:

q01-q07: Herbal A (folios f1-f56)
q08-q10: Herbal B / Pharmaceutical (folios f57-f84)
q11, q13: Pharmaceutical (folios f85-f90)
q14: Text-only "Recipes" (folios f103-f116)
q15: Astronomical/Astrological (folios f67-f70+89)
q20: Stars/Cosmological (circular diagrams)

Source: All transcriptions available at voynich.nu

🛠️ Automated Pipeline (3 Steps)

Step 1: Scrape Transcriptions

Use scrape_voynich_nu.py to download EVA transcriptions:

# Scrape a single quire
python scrape_voynich_nu.py --quire q03

# Scrape multiple quires
python scrape_voynich_nu.py --quire q04 q05 q06 q07

# Scrape all available quires
python scrape_voynich_nu.py --all

# List available quires first
python scrape_voynich_nu.py --list-quires

What it does:

Connects to voynich.nu
Finds all _tr.txt transliteration files for the quire
Downloads to data/scraped/{quire}/
Creates a manifest file
✅ No temp files created - all outputs saved permanently

Output: data/scraped/q03/f017r_tr.txt, f017v_tr.txt, etc.

Step 2: Copy to Folios Directory

The scraped files (raw EVA format) need to be copied to the main folios directory with proper naming:

# Copy all q03 files with correct naming
cd data/scraped/q03
for file in *_tr.txt; do 
    folio=$(basename "$file" _tr.txt)
    cp "$file" "../../folios/q03_$folio.txt"
done

# Verify
ls -1 ../../folios/q03_*.txt | wc -l

Important:

✅ Use raw EVA files (with headers and comments)
❌ Do NOT use parsed/cleaned files from parse_transcriptions.py
The translator handles EVA format internally

Naming Convention:

Source: f017r_tr.txt
Destination: q03_f017r.txt (in main folios directory, not subdirectory)

Step 3: Update Metadata

Add entries to data/folios/metadata.json for each new folio:

import json
from pathlib import Path
from datetime import datetime

# Load existing metadata
with open('data/folios/metadata.json') as f:
    metadata = json.load(f)

# Add new quire folios
for folio_file in sorted(Path('data/folios').glob('q03_f*.txt')):
    folio_id = folio_file.stem.replace('q03_f', '')
    key = f'q03_f{folio_id}'
    
    # Count words (from EVA transcription)
    lines = folio_file.read_text().splitlines()
    word_count = sum(len(line.split()) for line in lines 
                     if line.strip() and not line.startswith('#'))
    
    metadata[key] = {
        'folio_id': folio_id,
        'section': 'Herbal A',  # Adjust based on quire
        'word_count': word_count,
        'downloaded_at': datetime.now().isoformat(),
        'file': str(folio_file)
    }

# Save
with open('data/folios/metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

Section Mapping:

q01-q07: "Herbal A"
q08-q10: "Herbal B"
q11, q13: "Pharmaceutical"
q14: "Recipes"
q15: "Astronomical"
q20: "Cosmological"

✅ Testing New Sections

Translate a Single Folio

python translate_folio.py --section q03 --folio 017r --context herbal

Batch Translate Entire Quire

python translate_folio.py --section q03 --start 17 --end 24 --context herbal

Expected Results

First translation on new sections:

Coverage: 40-50% (using existing 748-word dictionary)
Unknown words: High (expected - new vocabulary)
Confidence: 0.35-0.45 average

After vocabulary iteration:

Coverage improves to 55-65% range
Pattern similar to Herbal A progression

🎯 Practical Example: Adding Q03

Complete Workflow

# 1. Scrape q03
python scrape_voynich_nu.py --quire q03 --output-dir data/scraped

# 2. Copy raw EVA files to folios directory (NOT subdirectory!)
cd data/scraped/q03
for file in *_tr.txt; do 
    folio=$(basename "$file" _tr.txt)
    cp "$file" "../../folios/q03_$folio.txt"
done
cd ../../..

# Verify files are in the right place
ls -1 data/folios/q03_f*.txt | head -3
# Expected: data/folios/q03_f017r.txt (NOT data/folios/q03/f017r.txt)

# 3. Update metadata (use Python script from Step 3 above)
python -c "
import json
from pathlib import Path
from datetime import datetime

with open('data/folios/metadata.json') as f:
    metadata = json.load(f)

for folio_file in sorted(Path('data/folios').glob('q03_f*.txt')):
    folio_id = folio_file.stem.replace('q03_f', '')
    key = f'q03_f{folio_id}'
    
    lines = folio_file.read_text().splitlines()
    word_count = sum(len(line.split()) for line in lines 
                     if line.strip() and not line.startswith('#'))
    
    metadata[key] = {
        'folio_id': folio_id,
        'section': 'Herbal A',
        'word_count': word_count,
        'downloaded_at': datetime.now().isoformat(),
        'file': str(folio_file)
    }

with open('data/folios/metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)
print(f'✅ Added {len([k for k in metadata if k.startswith(\"q03\")])} q03 folios')
"

# 4. Test translation
python translate_folio.py --section q03 --folio 017r --context herbal

# 5. View results
python translate_folio.py --section q03 --show 017r

Result: ✅ Successfully translated q03_f017r with 46.7% coverage!

Cleanup:

✅ No temp files created
✅ Scraped files remain in data/scraped/ for reference
✅ Working files in data/folios/q03_*.txt (flat structure)

📦 Available Tools

1. scrape_voynich_nu.py

Purpose: Download EVA transcriptions
Source: https://www.voynich.nu
Dependencies: beautifulsoup4, requests
Features:
- Automatic quire discovery
- Polite rate limiting (0.5s delay)
- Skip already-downloaded files
- Progress tracking
- Manifest generation

2. parse_transcriptions.py (Optional - Not Needed for Translation)

Purpose: Parse EVA format into clean text
⚠️ Skip this tool - Not used in translation workflow
Useful for: Text analysis, linguistic research, standalone corpus
Note: The translator (translator.py) handles EVA parsing internally

3. translate_folio.py

Purpose: Translate folios using dictionary
Handles: EVA format parsing automatically
Output: JSON with Latin and English translations

🗺️ Expansion Roadmap

Priority 1: Complete Herbal Sections (Recommended Next)

# Quires 3-7 (Herbal A continuation)
python scrape_voynich_nu.py --quire q03 q04 q05 q06 q07

# Quires 8-10 (Herbal B / Pharmaceutical)
python scrape_voynich_nu.py --quire q08 q09 q10

Why: Similar botanical content to existing dictionary

Expected Effort:

Scraping: 5 minutes per quire
Translation: 10 minutes per quire
Vocabulary iteration: 1-2 hours per major section group

Priority 2: Pharmaceutical Sections

python scrape_voynich_nu.py --quire q11 q13

Why: Related to herbal medicine

Challenge: May introduce new vocabulary domain

Priority 3: Astronomical/Cosmological

python scrape_voynich_nu.py --quire q15 q20

Why: Completely different subject matter

Challenge:

New vocabulary domain
Diagram-heavy (less pure text)
May require specialized dictionary

Priority 4: Recipes Section

python scrape_voynich_nu.py --quire q14

Why: Text-only section (no illustrations)

Challenge: Dense text, possibly different language style

🔄 Iterative Vocabulary Extension

When adding new sections with low coverage:

Translate new section
Analyze unknown words (use analyze_gaps.py)
Identify high-frequency unknowns
Run iteration workflow (see WORKFLOW_INSTRUCTIONS.md)
Re-translate with expanded dictionary
Repeat until coverage plateaus

Expected Coverage Growth:

Initial: 40-50% (with existing dictionary)
After 1-2 iterations: 55-65%
Plateau: 60-70% (systematic vocabulary complete)

🚨 Important Notes

File Format

Use raw EVA transcription files (with headers and metadata)
Do NOT use cleaned/parsed files for translation
The translator (translator.py) handles EVA parsing internally

Metadata Fields

Required in metadata.json:

{
  "q03_f017r": {
    "folio_id": "017r",
    "section": "Herbal A",
    "word_count": 274,
    "downloaded_at": "2025-11-27T21:45:00",
    "file": "data/folios/q03_f017r.txt"
  }
}

Section Context

For translation, specify appropriate context:

herbal: Botanical sections (q01-q10)
pharmaceutical: Medicine sections (q11, q13)
astronomical: Star charts (q15, q20)
recipes: Text-only section (q14)

Context affects polysemy resolution in the dictionary.

📊 Success Metrics

After Adding New Section

Immediate (before iteration):

✅ Files downloaded and copied
✅ Metadata updated
✅ Translation runs without errors
✅ Coverage: 40-50% typical

After 1-2 Iterations:

✅ Coverage: 55-65%
✅ High-frequency unknowns resolved
✅ Morphological patterns discovered

Project-wide:

✅ Dictionary grows by 20-40 words per section
✅ Overall system coverage improves
✅ Cross-section consistency maintained

💡 Pro Tips

1. Batch Processing

Process quires in logical groups:

All Herbal A (q01-q07) together
All Herbal B (q08-q10) together
This allows vocabulary learning across related sections

2. Validate Before Iteration

python scripts/validation_checker.py --check-type all

3. Track Unknown Words

python scripts/word_frequency.py --min-freq 3 --section q03

4. Compare Sections

Use the same folios across quires to track consistency:

f1r, f1v (first page pattern)
Last pages (conclusion patterns)

🔍 Troubleshooting

Issue: "Folio not found"

Solution: Ensure metadata.json has entry for the folio

Issue: "Total words: 0"

Solution: File might be in wrong format - use raw EVA files

Issue: "Very low coverage (< 30%)"

Solution: Expected for new sections - run vocabulary iteration

Issue: "Translation errors on special folios"

Solution: Some folios (like circular diagrams) may have non-standard format

📝 Next Steps After Expansion

Run gap analysis on new sections
Identify section-specific vocabulary
Run iteration to extend dictionary
Update section mapping in voynich.yaml polysemy
Document new patterns discovered

🎯 Goal

Ultimate Objective: Cover all available Voynich Manuscript transcriptions with systematic translation

Current Progress:

✅ Q01-Q02: Complete (22 folios, 60.4% avg coverage)
✅ Q03: Validated (16 folios, testing complete)
⏳ Q04-Q20: Available for expansion

Estimated Total: ~240 folios across all quires

Status: ✅ Pipeline Operational
Ready for: Large-scale section expansion
Automation Level: High (3-step process)

Guide created: November 27, 2025
Successfully validated with Q03 expansion
Tools: scrape_voynich_nu.py, translate_folio.py

FilesExpand file tree

SECTION_EXPANSION_GUIDE.md

Latest commit

History