Voynich Translation System Architecture

Technical Overview and System Design

This document provides a comprehensive technical overview of the Voynich manuscript translation system, including its architecture, components, and underlying methodology.

🎯 System Overview

The Voynich Translation System is a deterministic, automated pipeline for translating Voynichese (the unknown script of the Voynich Manuscript) into Latin and English using a rule-based approach with polysemy support and morphological analysis.

System Flow

voynich.nu → Download → Parse → Translate → Analyze → Update → Iterate
     ↓          ↓         ↓         ↓          ↓         ↓        ↓
   Source   Folio     Cleaned   Latin +    Gap      Dictionary  Loop
            Text      Words    English  Analysis    Updates

Key Capabilities

✅ Automated folio downloading from voynich.nu
✅ Deterministic translation using rule-based dictionary
✅ Context-aware polysemy (same word, different meanings by section)
✅ Morphological processing (prefixes, suffixes, compounds)
✅ Dual-language output (Latin + English)
✅ Gap analysis and vocabulary extension
✅ Coverage metrics and confidence scoring

🏗️ System Architecture

Component Diagram

┌─────────────────────────────────────────────────────────────┐
│                    Voynich Translation System               │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌──────────────┐       ┌──────────────┐                  │
│  │   Download   │──────▶│    Parser    │                  │
│  │  Folios      │       │   (EVA)      │                  │
│  └──────────────┘       └──────────────┘                  │
│                                │                            │
│                                ▼                            │
│                       ┌──────────────┐                     │
│                       │  Translator  │◀─── voynich.yaml   │
│                       │   Engine     │                     │
│                       └──────────────┘                     │
│                                │                            │
│                    ┌───────────┴───────────┐              │
│                    ▼                       ▼              │
│           ┌──────────────┐       ┌──────────────┐        │
│           │    Latin     │       │   English    │        │
│           │  Translation │       │  Translation │        │
│           └──────────────┘       └──────────────┘        │
│                    │                       │              │
│                    └───────────┬───────────┘              │
│                                ▼                          │
│                       ┌──────────────┐                    │
│                       │  Gap         │                    │
│                       │  Analysis    │                    │
│                       └──────────────┘                    │
│                                │                          │
│                                ▼                          │
│                       ┌──────────────┐                    │
│                       │  Dictionary  │                    │
│                       │  Updates     │                    │
│                       └──────────────┘                    │
│                                                           │
└───────────────────────────────────────────────────────────┘

Core Components

1. Folio Downloader (`download_folios.py`)

Downloads transcriptions from voynich.nu
Caches locally to avoid redundant downloads
Parses EVA (European Voynich Alphabet) format
Cleans transcription artifacts
Tracks metadata (section, word count)

2. Translation Engine (`translator.py`)

Core deterministic translation logic
Loads dictionary from voynich.yaml
Processes morphological rules
Resolves polysemy based on context
Handles repetition patterns
Tracks confidence scores
Maintains unknown word lists

3. CLI Translator (`translate_folio.py`)

Command-line interface for translations
Single folio or batch mode
Auto-detects context from section
Saves results as JSON
Displays coverage statistics
Force re-translation option

4. Gap Analyzer (`analyze_gaps.py`)

Analyzes unknown words across translations
Ranks by frequency and priority
Identifies structural patterns (prefixes, suffixes)
Suggests dictionary entries with reasoning
Exports suggestions as JSON

5. Helper Scripts (in `scripts/` directory)

word_frequency.py: Frequency analysis tool
morphology_analyzer.py: Morphological decomposition
pattern_detector.py: Pattern and formula detection
compound_decomposer.py: Compound word analysis
batch_dictionary_updater.py: Dictionary management
validation_checker.py: Dictionary integrity checks
iteration_orchestrator.py: Full workflow automation

📊 Data Flow

Download Phase

1. User requests folio (e.g., q02_f014v)
2. download_folios.py checks cache
3. If not cached:
   - Fetch from voynich.nu
   - Parse EVA transcription
   - Clean artifacts (!, *, {}, etc.)
   - Save to data/folios/
4. Update metadata.json

Translation Phase

1. Load folio from cache
2. Extract word sequence
3. For each word:
   a. Preprocess (remove nulls)
   b. Check dictionary
   c. Handle morphology (prefixes/suffixes)
   d. Resolve polysemy (context-based)
   e. Assign confidence score
4. Generate Latin text
5. Translate Latin → English
6. Calculate statistics
7. Save to data/translations/

Analysis Phase

1. Load all translation files
2. Collect unknown words
3. Count frequencies
4. Analyze word structure:
   - Identify potential prefixes
   - Identify potential suffixes
   - Find embedded known roots
5. Calculate priority scores
6. Generate suggestions
7. Export to dictionary_suggestions.json

🗂️ File Structure

voynich/
├── Core Scripts
│   ├── download_folios.py       # Folio downloader
│   ├── translator.py            # Translation engine
│   ├── translate_folio.py       # CLI interface
│   ├── analyze_gaps.py          # Gap analyzer
│   └── review_and_update.py     # Dictionary updater
│
├── Helper Scripts
│   └── scripts/
│       ├── word_frequency.py    # Frequency analysis
│       ├── morphology_analyzer.py  # Morphological decomposition
│       ├── pattern_detector.py  # Pattern detection
│       ├── compound_decomposer.py  # Compound analysis
│       ├── batch_dictionary_updater.py  # Dictionary management
│       ├── validation_checker.py  # Integrity checks
│       └── iteration_orchestrator.py  # Workflow automation
│
├── Configuration
│   ├── voynich.yaml             # Master dictionary (708 words)
│   ├── agent_config.yaml        # AI agent configuration
│   ├── research_workflow.yaml   # Workflow definition
│   └── vocabulary_rules.yaml    # Morphological rules
│
├── Data
│   ├── data/
│   │   ├── folios/              # Downloaded folios
│   │   │   ├── q01_f001r.txt
│   │   │   ├── q02_f014v.txt
│   │   │   └── metadata.json
│   │   ├── translations/        # Translation outputs
│   │   │   ├── q01_f001r_translation.json
│   │   │   └── q02_f014v_translation.json
│   │   └── dictionary_suggestions.json  # Gap analysis
│   └── reports/                 # Iteration reports
│
└── Documentation
    ├── AI_RESEARCH_GUIDE.md     # AI agent instructions
    ├── WORKFLOW_INSTRUCTIONS.md # Step-by-step workflow
    ├── VOCABULARY_EXTENSION_GUIDE.md  # Linguistic methodology
    ├── SYSTEM_ARCHITECTURE.md   # This file
    ├── RESEARCH_RESULTS.md      # Performance and results
    ├── DEVELOPMENT_GUIDE.md     # Usage guide
    ├── MASTER_INDEX.md          # Navigation hub
    └── README.md                # Project overview

🔧 Technical Specifications

Dictionary Format (`voynich.yaml`)

voynich_decipherment_rules:
  rules:
    # Preprocessing rules
  
  glyph_mapping:
    # EVA glyphs to phonemes
  
  vocab:
    - word: fachys
      latin: folium
      description: "leaf; appears near plant leaves"
    
    - word: chol
      latin: caulis
      description: "stem/stalk; botanical term"
  
  polysemy:
    - word: qokedy
      meanings:
        - latin: crescit
          context: "herbal section, near plants"
        - latin: lucet
          context: "astronomical section, near stars"
        - latin: fluit
          context: "biological section, near water"
      base: crescit

Translation Output Format

{
  "folio_id": "014v",
  "section": "Herbal B",
  "context": "herbal",
  "voynich_text": "fachys ykal ar shy daiin...",
  "latin_text": "folium altum et hic ad...",
  "english_text": "leaf tall and here to...",
  "word_translations": [
    {
      "original": "fachys",
      "latin": "folium",
      "english": "leaf",
      "confidence": 0.9,
      "method": "dictionary",
      "notes": "near plants"
    }
  ],
  "statistics": {
    "total_words": 267,
    "known_words": 122,
    "unknown_words": 145,
    "coverage": 0.457,
    "avg_confidence": 0.82
  },
  "unknown_words": ["word1", "word2", ...]
}

Gap Analysis Format

{
  "word": "kokaiin",
  "frequency": 20,
  "priority_score": 285.0,
  "sections": ["q01", "q02"],
  "contexts": ["herbal"],
  "length": 7,
  "analysis": {
    "structure": {
      "prefixes": [],
      "suffixes": ["aiin"],
      "potential_roots": ["kok"]
    },
    "patterns": ["contains_aiin_suffix"]
  },
  "suggested_latin": "maturat",
  "reasoning": "Appears near fruits/seeds; compound kok + aiin (makes + is/was)"
}

🧠 Translation Algorithm

Core Translation Logic

def translate_word(word: str, context: str) -> TranslationResult:
    """
    Core translation algorithm
    """
    # 1. Preprocess
    word = preprocess_word(word)  # Remove nulls, clean
    
    # 2. Direct dictionary lookup
    if word in dictionary:
        return dictionary[word]
    
    # 3. Check polysemy
    if word in polysemy_dict:
        return resolve_polysemy(word, context)
    
    # 4. Handle prefixes (qo-, ot-, sh-, etc.)
    if has_prefix(word):
        prefix, root = extract_prefix(word)
        if root in dictionary:
            return apply_prefix_meaning(prefix, dictionary[root])
    
    # 5. Handle suffixes (-aiin, -edy, -ar, etc.)
    if has_suffix(word):
        root, suffix = extract_suffix(word)
        if root in dictionary:
            return apply_suffix_meaning(dictionary[root], suffix)
    
    # 6. Handle repetition (valde)
    if is_repeated(word):
        return apply_intensifier(word)
    
    # 7. Unknown
    return TranslationResult(
        original=word,
        latin="[unknown]",
        confidence=0.0,
        method="unknown"
    )

Polysemy Resolution

def resolve_polysemy(word: str, context: str) -> str:
    """
    Resolve word meaning based on context
    """
    polysemy_entry = polysemy_dict[word]
    
    # Try to match context
    for meaning in polysemy_entry['meanings']:
        if context in meaning['context'].lower():
            return meaning['latin']
    
    # Fall back to base meaning
    return polysemy_entry['base']

Morphological Processing

def handle_qo_prefix(word: str, context: str) -> str:
    """
    Handle qo- intensifier prefix
    """
    if word.startswith('qo'):
        root = word[2:]  # Remove 'qo'
        if root in dictionary:
            base_latin = dictionary[root]['latin']
            # Add intensifier
            return f"valde {base_latin}"
    return None

📈 Performance Characteristics

Computational Complexity

Dictionary lookup: O(1) average (hash table)
Morphological analysis: O(k) where k = number of prefix/suffix patterns
Polysemy resolution: O(m) where m = number of meanings per word
Full folio translation: O(n) where n = number of words

Scalability

Current: 22 folios, 708-word dictionary
Tested: Up to 6,655 words per batch
Memory: < 50 MB for full system
Speed: ~100-200 words/second

Coverage Metrics

Herbal B: 65.2% average coverage
Herbal A: 52.0% average coverage
Combined: 55.6% average coverage
Best folio: 73.1% (q02_f014r)

🔬 Linguistic Foundation

Hypothesis

The Voynich Manuscript is written in an encoded form of Medieval Latin using:

Substitution cipher: Voynich glyphs → Latin phonemes
Null glyphs: 'o' as filler to obscure patterns
Abbreviations: Medieval shorthand (e.g., aiin = erat)
Morphological consistency: Systematic prefix/suffix patterns

Glyph Mapping

Based on EVA (European Voynich Alphabet):

EVA	Phoneme	Latin	Notes
f	/f/	f	Common in herbal (folium)
p	/p/	p	Plant prefix (planta)
ch	/k/	c	Hard consonant (caulis)
o	/o/ or null	o/!	Null or exclamation
l	/l/	l	Liquid consonant
y	/j/	i/y	Semivowel
k	/k/	c/k	Hard consonant
t	/t/	t	Stop consonant
e	/e/	e	Vowel
d	/d/	d	Stop, verb marker
q	/kw/	qu	Intensifier prefix
ai	/ai/	ae/e	Diphthong
sh	/ʃ/	sh	Location marker

Morphological Patterns

Prefixes:

qo-: Intensifier (valde) - confidence 0.9
ot-: Source (ex) - confidence 0.8
sh-: Location (hic) - confidence 0.8
ch-: Botanical - confidence 0.7

Suffixes:

-aiin: State marker (est/erat) - confidence 0.9
-edy: Action verb (movet) - confidence 0.8
-ar: Conjunction (et) - confidence 0.7
-ol: Location (locus) - confidence 0.6

🛡️ Quality Control

Validation Mechanisms

Dictionary Validation
- YAML syntax checking
- Duplicate detection
- Required field verification
- Format validation
Translation Validation
- Coverage metrics
- Confidence scoring
- Unknown word tracking
- Statistical analysis
Morphological Validation
- Pattern consistency checks
- Prefix/suffix validation
- Root existence verification

Error Handling

# Graceful degradation
if not dictionary_loaded:
    return "Error: Dictionary not loaded"

if word_not_found:
    track_unknown(word)
    return "[unknown]"

if confidence < threshold:
    flag_for_review(word)

🔐 Configuration Management

Agent Configuration (`agent_config.yaml`)

Defines AI agent behavior:

Workflow parameters (frequency thresholds, batch sizes)
Confidence thresholds
Validation gates
Tool configurations
Context awareness rules
Decision framework

Workflow Definition (`research_workflow.yaml`)

Defines research phases:

Analyze (validate, coverage, identify unknowns)
Propose (morphology, compounds, ranking)
Validate (consistency, visual, polysemy)
Implement (backup, update, validate)
Test (re-translate, calculate, quality check)
Report (generate, metrics, next steps)

Vocabulary Rules (`vocabulary_rules.yaml`)

Defines linguistic rules:

Morphological decomposition rules
Compound formation patterns
Polysemy detection criteria
Word family generation rules
Validation rules for entries

🎯 Design Principles

1. Deterministic

Same input → same output
No randomness or ML uncertainty
Reproducible results

2. Modular

Each component has single responsibility
Easy to test and maintain
Can be used independently

3. Extensible

Dictionary can grow incrementally
New rules can be added
Supports future enhancements

4. Data-Driven

Configuration via YAML files
Human-readable formats
Easy to modify and version

5. Traceable

Every decision documented
Confidence scores tracked
Unknown words logged

🔄 System Evolution

Version History

V1.0 (Initial)

Basic glyph mappings
~50 word dictionary
Single context translation

V2.0 (Polysemy)

Context-aware translation
Polysemy system
~300 word dictionary

V3.0 (Morphology)

Prefix/suffix handling
Morphological analysis
~450 word dictionary

V4.0 (Systematic)

Word family generation
Duplicate cleanup
English translation
~708 word dictionary

V5.0 (AI Agent)

Complete helper scripts
Workflow automation
Configuration system
Documentation suite

📊 Technical Metrics

Current System Stats

Lines of Code: ~3,500 (Python)
Configuration: ~2,000 (YAML)
Documentation: ~15,000 words
Dictionary Entries: 708 words
Polysemy Entries: 10 words
Supported Contexts: 5 (herbal, astronomical, biological, pharmaceutical, cosmological)
Helper Scripts: 7 tools
Folios Processed: 22 pages
Total Words Analyzed: 6,655
Unique Words Identified: 1,060

🚀 Future Architecture Enhancements

Planned Improvements

Machine Learning Layer
- Auto-suggest morphological decompositions
- Predict word meanings from context
- Pattern recognition for compounds
Visual Integration
- OCR for direct image processing
- Image-to-text correlation
- Botanical species identification
Database Backend
- SQL database for dictionary
- Query optimization
- Version control for entries
Web Interface
- Browser-based translation tool
- Interactive vocabulary editor
- Visualization dashboard
API Layer
- RESTful API for translations
- Integration with other tools
- Batch processing endpoints

🔗 Dependencies

Python Requirements

python >= 3.8
httpx >= 0.24.0
pyyaml >= 6.0
pathlib (standard library)
json (standard library)
argparse (standard library)

External Resources

voynich.nu: Source of EVA transcriptions
YAML: Configuration format
EVA Alphabet: Standard transcription system

📚 References

Technical Standards

EVA (European Voynich Alphabet): Standard transcription system
YAML 1.2: Configuration file format
JSON: Data interchange format
UTF-8: Character encoding

Research Foundation

Currier's A/B Dialects: Glyph variation patterns
Stolfi's Entropy Analysis: Information content studies
Tiltman's Word Length: Statistical observations
Takahashi's EVA: Digital transcription standard

System Status: OPERATIONAL ✅
Architecture Version: 5.0
Last Updated: November 27, 2025

For usage instructions, see DEVELOPMENT_GUIDE.md.
For research results, see RESEARCH_RESULTS.md.
For AI agent instructions, see AI_RESEARCH_GUIDE.md.

FilesExpand file tree

SYSTEM_ARCHITECTURE.md

Latest commit

History