This guide explains how to expand the translation system to cover new sections of the Voynich Manuscript beyond the initial Herbal A and B sections.
Created: November 27, 2025
Last Updated: Post-Iteration 6
Status: ✅ Fully Automated Pipeline
The Voynich Manuscript consists of multiple sections/quires:
- q01-q07: Herbal A (folios f1-f56)
- q08-q10: Herbal B / Pharmaceutical (folios f57-f84)
- q11, q13: Pharmaceutical (folios f85-f90)
- q14: Text-only "Recipes" (folios f103-f116)
- q15: Astronomical/Astrological (folios f67-f70+89)
- q20: Stars/Cosmological (circular diagrams)
Source: All transcriptions available at voynich.nu
Use scrape_voynich_nu.py to download EVA transcriptions:
# Scrape a single quire
python scrape_voynich_nu.py --quire q03
# Scrape multiple quires
python scrape_voynich_nu.py --quire q04 q05 q06 q07
# Scrape all available quires
python scrape_voynich_nu.py --all
# List available quires first
python scrape_voynich_nu.py --list-quiresWhat it does:
- Connects to voynich.nu
- Finds all
_tr.txttransliteration files for the quire - Downloads to
data/scraped/{quire}/ - Creates a manifest file
- ✅ No temp files created - all outputs saved permanently
Output: data/scraped/q03/f017r_tr.txt, f017v_tr.txt, etc.
The scraped files (raw EVA format) need to be copied to the main folios directory with proper naming:
# Copy all q03 files with correct naming
cd data/scraped/q03
for file in *_tr.txt; do
folio=$(basename "$file" _tr.txt)
cp "$file" "../../folios/q03_$folio.txt"
done
# Verify
ls -1 ../../folios/q03_*.txt | wc -lImportant:
- ✅ Use raw EVA files (with headers and comments)
- ❌ Do NOT use parsed/cleaned files from
parse_transcriptions.py - The translator handles EVA format internally
Naming Convention:
- Source:
f017r_tr.txt - Destination:
q03_f017r.txt(in main folios directory, not subdirectory)
Add entries to data/folios/metadata.json for each new folio:
import json
from pathlib import Path
from datetime import datetime
# Load existing metadata
with open('data/folios/metadata.json') as f:
metadata = json.load(f)
# Add new quire folios
for folio_file in sorted(Path('data/folios').glob('q03_f*.txt')):
folio_id = folio_file.stem.replace('q03_f', '')
key = f'q03_f{folio_id}'
# Count words (from EVA transcription)
lines = folio_file.read_text().splitlines()
word_count = sum(len(line.split()) for line in lines
if line.strip() and not line.startswith('#'))
metadata[key] = {
'folio_id': folio_id,
'section': 'Herbal A', # Adjust based on quire
'word_count': word_count,
'downloaded_at': datetime.now().isoformat(),
'file': str(folio_file)
}
# Save
with open('data/folios/metadata.json', 'w') as f:
json.dump(metadata, f, indent=2)Section Mapping:
- q01-q07: "Herbal A"
- q08-q10: "Herbal B"
- q11, q13: "Pharmaceutical"
- q14: "Recipes"
- q15: "Astronomical"
- q20: "Cosmological"
python translate_folio.py --section q03 --folio 017r --context herbalpython translate_folio.py --section q03 --start 17 --end 24 --context herbalFirst translation on new sections:
- Coverage: 40-50% (using existing 748-word dictionary)
- Unknown words: High (expected - new vocabulary)
- Confidence: 0.35-0.45 average
After vocabulary iteration:
- Coverage improves to 55-65% range
- Pattern similar to Herbal A progression
# 1. Scrape q03
python scrape_voynich_nu.py --quire q03 --output-dir data/scraped
# 2. Copy raw EVA files to folios directory (NOT subdirectory!)
cd data/scraped/q03
for file in *_tr.txt; do
folio=$(basename "$file" _tr.txt)
cp "$file" "../../folios/q03_$folio.txt"
done
cd ../../..
# Verify files are in the right place
ls -1 data/folios/q03_f*.txt | head -3
# Expected: data/folios/q03_f017r.txt (NOT data/folios/q03/f017r.txt)
# 3. Update metadata (use Python script from Step 3 above)
python -c "
import json
from pathlib import Path
from datetime import datetime
with open('data/folios/metadata.json') as f:
metadata = json.load(f)
for folio_file in sorted(Path('data/folios').glob('q03_f*.txt')):
folio_id = folio_file.stem.replace('q03_f', '')
key = f'q03_f{folio_id}'
lines = folio_file.read_text().splitlines()
word_count = sum(len(line.split()) for line in lines
if line.strip() and not line.startswith('#'))
metadata[key] = {
'folio_id': folio_id,
'section': 'Herbal A',
'word_count': word_count,
'downloaded_at': datetime.now().isoformat(),
'file': str(folio_file)
}
with open('data/folios/metadata.json', 'w') as f:
json.dump(metadata, f, indent=2)
print(f'✅ Added {len([k for k in metadata if k.startswith(\"q03\")])} q03 folios')
"
# 4. Test translation
python translate_folio.py --section q03 --folio 017r --context herbal
# 5. View results
python translate_folio.py --section q03 --show 017rResult: ✅ Successfully translated q03_f017r with 46.7% coverage!
Cleanup:
- ✅ No temp files created
- ✅ Scraped files remain in
data/scraped/for reference - ✅ Working files in
data/folios/q03_*.txt(flat structure)
- Purpose: Download EVA transcriptions
- Source: https://www.voynich.nu
- Dependencies:
beautifulsoup4,requests - Features:
- Automatic quire discovery
- Polite rate limiting (0.5s delay)
- Skip already-downloaded files
- Progress tracking
- Manifest generation
- Purpose: Parse EVA format into clean text
⚠️ Skip this tool - Not used in translation workflow- Useful for: Text analysis, linguistic research, standalone corpus
- Note: The translator (
translator.py) handles EVA parsing internally
- Purpose: Translate folios using dictionary
- Handles: EVA format parsing automatically
- Output: JSON with Latin and English translations
# Quires 3-7 (Herbal A continuation)
python scrape_voynich_nu.py --quire q03 q04 q05 q06 q07
# Quires 8-10 (Herbal B / Pharmaceutical)
python scrape_voynich_nu.py --quire q08 q09 q10Why: Similar botanical content to existing dictionary
Expected Effort:
- Scraping: 5 minutes per quire
- Translation: 10 minutes per quire
- Vocabulary iteration: 1-2 hours per major section group
python scrape_voynich_nu.py --quire q11 q13Why: Related to herbal medicine
Challenge: May introduce new vocabulary domain
python scrape_voynich_nu.py --quire q15 q20Why: Completely different subject matter
Challenge:
- New vocabulary domain
- Diagram-heavy (less pure text)
- May require specialized dictionary
python scrape_voynich_nu.py --quire q14Why: Text-only section (no illustrations)
Challenge: Dense text, possibly different language style
When adding new sections with low coverage:
- Translate new section
- Analyze unknown words (use
analyze_gaps.py) - Identify high-frequency unknowns
- Run iteration workflow (see WORKFLOW_INSTRUCTIONS.md)
- Re-translate with expanded dictionary
- Repeat until coverage plateaus
Expected Coverage Growth:
- Initial: 40-50% (with existing dictionary)
- After 1-2 iterations: 55-65%
- Plateau: 60-70% (systematic vocabulary complete)
- Use raw EVA transcription files (with headers and metadata)
- Do NOT use cleaned/parsed files for translation
- The translator (
translator.py) handles EVA parsing internally
Required in metadata.json:
{
"q03_f017r": {
"folio_id": "017r",
"section": "Herbal A",
"word_count": 274,
"downloaded_at": "2025-11-27T21:45:00",
"file": "data/folios/q03_f017r.txt"
}
}For translation, specify appropriate context:
- herbal: Botanical sections (q01-q10)
- pharmaceutical: Medicine sections (q11, q13)
- astronomical: Star charts (q15, q20)
- recipes: Text-only section (q14)
Context affects polysemy resolution in the dictionary.
Immediate (before iteration):
- ✅ Files downloaded and copied
- ✅ Metadata updated
- ✅ Translation runs without errors
- ✅ Coverage: 40-50% typical
After 1-2 Iterations:
- ✅ Coverage: 55-65%
- ✅ High-frequency unknowns resolved
- ✅ Morphological patterns discovered
Project-wide:
- ✅ Dictionary grows by 20-40 words per section
- ✅ Overall system coverage improves
- ✅ Cross-section consistency maintained
Process quires in logical groups:
- All Herbal A (q01-q07) together
- All Herbal B (q08-q10) together
- This allows vocabulary learning across related sections
python scripts/validation_checker.py --check-type allpython scripts/word_frequency.py --min-freq 3 --section q03Use the same folios across quires to track consistency:
- f1r, f1v (first page pattern)
- Last pages (conclusion patterns)
Solution: Ensure metadata.json has entry for the folio
Solution: File might be in wrong format - use raw EVA files
Solution: Expected for new sections - run vocabulary iteration
Solution: Some folios (like circular diagrams) may have non-standard format
- Run gap analysis on new sections
- Identify section-specific vocabulary
- Run iteration to extend dictionary
- Update section mapping in
voynich.yamlpolysemy - Document new patterns discovered
Ultimate Objective: Cover all available Voynich Manuscript transcriptions with systematic translation
Current Progress:
- ✅ Q01-Q02: Complete (22 folios, 60.4% avg coverage)
- ✅ Q03: Validated (16 folios, testing complete)
- ⏳ Q04-Q20: Available for expansion
Estimated Total: ~240 folios across all quires
Status: ✅ Pipeline Operational
Ready for: Large-scale section expansion
Automation Level: High (3-step process)
Guide created: November 27, 2025
Successfully validated with Q03 expansion
Tools: scrape_voynich_nu.py, translate_folio.py