Skip to content

Latest commit

Β 

History

History
689 lines (553 loc) Β· 19.8 KB

File metadata and controls

689 lines (553 loc) Β· 19.8 KB

Voynich Translation System Architecture

Technical Overview and System Design

This document provides a comprehensive technical overview of the Voynich manuscript translation system, including its architecture, components, and underlying methodology.


🎯 System Overview

The Voynich Translation System is a deterministic, automated pipeline for translating Voynichese (the unknown script of the Voynich Manuscript) into Latin and English using a rule-based approach with polysemy support and morphological analysis.

System Flow

voynich.nu β†’ Download β†’ Parse β†’ Translate β†’ Analyze β†’ Update β†’ Iterate
     ↓          ↓         ↓         ↓          ↓         ↓        ↓
   Source   Folio     Cleaned   Latin +    Gap      Dictionary  Loop
            Text      Words    English  Analysis    Updates

Key Capabilities

  • βœ… Automated folio downloading from voynich.nu
  • βœ… Deterministic translation using rule-based dictionary
  • βœ… Context-aware polysemy (same word, different meanings by section)
  • βœ… Morphological processing (prefixes, suffixes, compounds)
  • βœ… Dual-language output (Latin + English)
  • βœ… Gap analysis and vocabulary extension
  • βœ… Coverage metrics and confidence scoring

πŸ—οΈ System Architecture

Component Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Voynich Translation System               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”‚
β”‚  β”‚   Download   │──────▢│    Parser    β”‚                  β”‚
β”‚  β”‚  Folios      β”‚       β”‚   (EVA)      β”‚                  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β”‚                                β”‚                            β”‚
β”‚                                β–Ό                            β”‚
β”‚                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                     β”‚
β”‚                       β”‚  Translator  │◀─── voynich.yaml   β”‚
β”‚                       β”‚   Engine     β”‚                     β”‚
β”‚                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β”‚
β”‚                                β”‚                            β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
β”‚                    β–Ό                       β–Ό              β”‚
β”‚           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚           β”‚    Latin     β”‚       β”‚   English    β”‚        β”‚
β”‚           β”‚  Translation β”‚       β”‚  Translation β”‚        β”‚
β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚                    β”‚                       β”‚              β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
β”‚                                β–Ό                          β”‚
β”‚                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚                       β”‚  Gap         β”‚                    β”‚
β”‚                       β”‚  Analysis    β”‚                    β”‚
β”‚                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚                                β”‚                          β”‚
β”‚                                β–Ό                          β”‚
β”‚                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚                       β”‚  Dictionary  β”‚                    β”‚
β”‚                       β”‚  Updates     β”‚                    β”‚
β”‚                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚                                                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

1. Folio Downloader (download_folios.py)

  • Downloads transcriptions from voynich.nu
  • Caches locally to avoid redundant downloads
  • Parses EVA (European Voynich Alphabet) format
  • Cleans transcription artifacts
  • Tracks metadata (section, word count)

2. Translation Engine (translator.py)

  • Core deterministic translation logic
  • Loads dictionary from voynich.yaml
  • Processes morphological rules
  • Resolves polysemy based on context
  • Handles repetition patterns
  • Tracks confidence scores
  • Maintains unknown word lists

3. CLI Translator (translate_folio.py)

  • Command-line interface for translations
  • Single folio or batch mode
  • Auto-detects context from section
  • Saves results as JSON
  • Displays coverage statistics
  • Force re-translation option

4. Gap Analyzer (analyze_gaps.py)

  • Analyzes unknown words across translations
  • Ranks by frequency and priority
  • Identifies structural patterns (prefixes, suffixes)
  • Suggests dictionary entries with reasoning
  • Exports suggestions as JSON

5. Helper Scripts (in scripts/ directory)

  • word_frequency.py: Frequency analysis tool
  • morphology_analyzer.py: Morphological decomposition
  • pattern_detector.py: Pattern and formula detection
  • compound_decomposer.py: Compound word analysis
  • batch_dictionary_updater.py: Dictionary management
  • validation_checker.py: Dictionary integrity checks
  • iteration_orchestrator.py: Full workflow automation

πŸ“Š Data Flow

Download Phase

1. User requests folio (e.g., q02_f014v)
2. download_folios.py checks cache
3. If not cached:
   - Fetch from voynich.nu
   - Parse EVA transcription
   - Clean artifacts (!, *, {}, etc.)
   - Save to data/folios/
4. Update metadata.json

Translation Phase

1. Load folio from cache
2. Extract word sequence
3. For each word:
   a. Preprocess (remove nulls)
   b. Check dictionary
   c. Handle morphology (prefixes/suffixes)
   d. Resolve polysemy (context-based)
   e. Assign confidence score
4. Generate Latin text
5. Translate Latin β†’ English
6. Calculate statistics
7. Save to data/translations/

Analysis Phase

1. Load all translation files
2. Collect unknown words
3. Count frequencies
4. Analyze word structure:
   - Identify potential prefixes
   - Identify potential suffixes
   - Find embedded known roots
5. Calculate priority scores
6. Generate suggestions
7. Export to dictionary_suggestions.json

πŸ—‚οΈ File Structure

voynich/
β”œβ”€β”€ Core Scripts
β”‚   β”œβ”€β”€ download_folios.py       # Folio downloader
β”‚   β”œβ”€β”€ translator.py            # Translation engine
β”‚   β”œβ”€β”€ translate_folio.py       # CLI interface
β”‚   β”œβ”€β”€ analyze_gaps.py          # Gap analyzer
β”‚   └── review_and_update.py     # Dictionary updater
β”‚
β”œβ”€β”€ Helper Scripts
β”‚   └── scripts/
β”‚       β”œβ”€β”€ word_frequency.py    # Frequency analysis
β”‚       β”œβ”€β”€ morphology_analyzer.py  # Morphological decomposition
β”‚       β”œβ”€β”€ pattern_detector.py  # Pattern detection
β”‚       β”œβ”€β”€ compound_decomposer.py  # Compound analysis
β”‚       β”œβ”€β”€ batch_dictionary_updater.py  # Dictionary management
β”‚       β”œβ”€β”€ validation_checker.py  # Integrity checks
β”‚       └── iteration_orchestrator.py  # Workflow automation
β”‚
β”œβ”€β”€ Configuration
β”‚   β”œβ”€β”€ voynich.yaml             # Master dictionary (708 words)
β”‚   β”œβ”€β”€ agent_config.yaml        # AI agent configuration
β”‚   β”œβ”€β”€ research_workflow.yaml   # Workflow definition
β”‚   └── vocabulary_rules.yaml    # Morphological rules
β”‚
β”œβ”€β”€ Data
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ folios/              # Downloaded folios
β”‚   β”‚   β”‚   β”œβ”€β”€ q01_f001r.txt
β”‚   β”‚   β”‚   β”œβ”€β”€ q02_f014v.txt
β”‚   β”‚   β”‚   └── metadata.json
β”‚   β”‚   β”œβ”€β”€ translations/        # Translation outputs
β”‚   β”‚   β”‚   β”œβ”€β”€ q01_f001r_translation.json
β”‚   β”‚   β”‚   └── q02_f014v_translation.json
β”‚   β”‚   └── dictionary_suggestions.json  # Gap analysis
β”‚   └── reports/                 # Iteration reports
β”‚
└── Documentation
    β”œβ”€β”€ AI_RESEARCH_GUIDE.md     # AI agent instructions
    β”œβ”€β”€ WORKFLOW_INSTRUCTIONS.md # Step-by-step workflow
    β”œβ”€β”€ VOCABULARY_EXTENSION_GUIDE.md  # Linguistic methodology
    β”œβ”€β”€ SYSTEM_ARCHITECTURE.md   # This file
    β”œβ”€β”€ RESEARCH_RESULTS.md      # Performance and results
    β”œβ”€β”€ DEVELOPMENT_GUIDE.md     # Usage guide
    β”œβ”€β”€ MASTER_INDEX.md          # Navigation hub
    └── README.md                # Project overview

πŸ”§ Technical Specifications

Dictionary Format (voynich.yaml)

voynich_decipherment_rules:
  rules:
    # Preprocessing rules
  
  glyph_mapping:
    # EVA glyphs to phonemes
  
  vocab:
    - word: fachys
      latin: folium
      description: "leaf; appears near plant leaves"
    
    - word: chol
      latin: caulis
      description: "stem/stalk; botanical term"
  
  polysemy:
    - word: qokedy
      meanings:
        - latin: crescit
          context: "herbal section, near plants"
        - latin: lucet
          context: "astronomical section, near stars"
        - latin: fluit
          context: "biological section, near water"
      base: crescit

Translation Output Format

{
  "folio_id": "014v",
  "section": "Herbal B",
  "context": "herbal",
  "voynich_text": "fachys ykal ar shy daiin...",
  "latin_text": "folium altum et hic ad...",
  "english_text": "leaf tall and here to...",
  "word_translations": [
    {
      "original": "fachys",
      "latin": "folium",
      "english": "leaf",
      "confidence": 0.9,
      "method": "dictionary",
      "notes": "near plants"
    }
  ],
  "statistics": {
    "total_words": 267,
    "known_words": 122,
    "unknown_words": 145,
    "coverage": 0.457,
    "avg_confidence": 0.82
  },
  "unknown_words": ["word1", "word2", ...]
}

Gap Analysis Format

{
  "word": "kokaiin",
  "frequency": 20,
  "priority_score": 285.0,
  "sections": ["q01", "q02"],
  "contexts": ["herbal"],
  "length": 7,
  "analysis": {
    "structure": {
      "prefixes": [],
      "suffixes": ["aiin"],
      "potential_roots": ["kok"]
    },
    "patterns": ["contains_aiin_suffix"]
  },
  "suggested_latin": "maturat",
  "reasoning": "Appears near fruits/seeds; compound kok + aiin (makes + is/was)"
}

🧠 Translation Algorithm

Core Translation Logic

def translate_word(word: str, context: str) -> TranslationResult:
    """
    Core translation algorithm
    """
    # 1. Preprocess
    word = preprocess_word(word)  # Remove nulls, clean
    
    # 2. Direct dictionary lookup
    if word in dictionary:
        return dictionary[word]
    
    # 3. Check polysemy
    if word in polysemy_dict:
        return resolve_polysemy(word, context)
    
    # 4. Handle prefixes (qo-, ot-, sh-, etc.)
    if has_prefix(word):
        prefix, root = extract_prefix(word)
        if root in dictionary:
            return apply_prefix_meaning(prefix, dictionary[root])
    
    # 5. Handle suffixes (-aiin, -edy, -ar, etc.)
    if has_suffix(word):
        root, suffix = extract_suffix(word)
        if root in dictionary:
            return apply_suffix_meaning(dictionary[root], suffix)
    
    # 6. Handle repetition (valde)
    if is_repeated(word):
        return apply_intensifier(word)
    
    # 7. Unknown
    return TranslationResult(
        original=word,
        latin="[unknown]",
        confidence=0.0,
        method="unknown"
    )

Polysemy Resolution

def resolve_polysemy(word: str, context: str) -> str:
    """
    Resolve word meaning based on context
    """
    polysemy_entry = polysemy_dict[word]
    
    # Try to match context
    for meaning in polysemy_entry['meanings']:
        if context in meaning['context'].lower():
            return meaning['latin']
    
    # Fall back to base meaning
    return polysemy_entry['base']

Morphological Processing

def handle_qo_prefix(word: str, context: str) -> str:
    """
    Handle qo- intensifier prefix
    """
    if word.startswith('qo'):
        root = word[2:]  # Remove 'qo'
        if root in dictionary:
            base_latin = dictionary[root]['latin']
            # Add intensifier
            return f"valde {base_latin}"
    return None

πŸ“ˆ Performance Characteristics

Computational Complexity

  • Dictionary lookup: O(1) average (hash table)
  • Morphological analysis: O(k) where k = number of prefix/suffix patterns
  • Polysemy resolution: O(m) where m = number of meanings per word
  • Full folio translation: O(n) where n = number of words

Scalability

  • Current: 22 folios, 708-word dictionary
  • Tested: Up to 6,655 words per batch
  • Memory: < 50 MB for full system
  • Speed: ~100-200 words/second

Coverage Metrics

  • Herbal B: 65.2% average coverage
  • Herbal A: 52.0% average coverage
  • Combined: 55.6% average coverage
  • Best folio: 73.1% (q02_f014r)

πŸ”¬ Linguistic Foundation

Hypothesis

The Voynich Manuscript is written in an encoded form of Medieval Latin using:

  • Substitution cipher: Voynich glyphs β†’ Latin phonemes
  • Null glyphs: 'o' as filler to obscure patterns
  • Abbreviations: Medieval shorthand (e.g., aiin = erat)
  • Morphological consistency: Systematic prefix/suffix patterns

Glyph Mapping

Based on EVA (European Voynich Alphabet):

EVA Phoneme Latin Notes
f /f/ f Common in herbal (folium)
p /p/ p Plant prefix (planta)
ch /k/ c Hard consonant (caulis)
o /o/ or null o/! Null or exclamation
l /l/ l Liquid consonant
y /j/ i/y Semivowel
k /k/ c/k Hard consonant
t /t/ t Stop consonant
e /e/ e Vowel
d /d/ d Stop, verb marker
q /kw/ qu Intensifier prefix
ai /ai/ ae/e Diphthong
sh /Κƒ/ sh Location marker

Morphological Patterns

Prefixes:

  • qo-: Intensifier (valde) - confidence 0.9
  • ot-: Source (ex) - confidence 0.8
  • sh-: Location (hic) - confidence 0.8
  • ch-: Botanical - confidence 0.7

Suffixes:

  • -aiin: State marker (est/erat) - confidence 0.9
  • -edy: Action verb (movet) - confidence 0.8
  • -ar: Conjunction (et) - confidence 0.7
  • -ol: Location (locus) - confidence 0.6

πŸ›‘οΈ Quality Control

Validation Mechanisms

  1. Dictionary Validation

    • YAML syntax checking
    • Duplicate detection
    • Required field verification
    • Format validation
  2. Translation Validation

    • Coverage metrics
    • Confidence scoring
    • Unknown word tracking
    • Statistical analysis
  3. Morphological Validation

    • Pattern consistency checks
    • Prefix/suffix validation
    • Root existence verification

Error Handling

# Graceful degradation
if not dictionary_loaded:
    return "Error: Dictionary not loaded"

if word_not_found:
    track_unknown(word)
    return "[unknown]"

if confidence < threshold:
    flag_for_review(word)

πŸ” Configuration Management

Agent Configuration (agent_config.yaml)

Defines AI agent behavior:

  • Workflow parameters (frequency thresholds, batch sizes)
  • Confidence thresholds
  • Validation gates
  • Tool configurations
  • Context awareness rules
  • Decision framework

Workflow Definition (research_workflow.yaml)

Defines research phases:

  1. Analyze (validate, coverage, identify unknowns)
  2. Propose (morphology, compounds, ranking)
  3. Validate (consistency, visual, polysemy)
  4. Implement (backup, update, validate)
  5. Test (re-translate, calculate, quality check)
  6. Report (generate, metrics, next steps)

Vocabulary Rules (vocabulary_rules.yaml)

Defines linguistic rules:

  • Morphological decomposition rules
  • Compound formation patterns
  • Polysemy detection criteria
  • Word family generation rules
  • Validation rules for entries

🎯 Design Principles

1. Deterministic

  • Same input β†’ same output
  • No randomness or ML uncertainty
  • Reproducible results

2. Modular

  • Each component has single responsibility
  • Easy to test and maintain
  • Can be used independently

3. Extensible

  • Dictionary can grow incrementally
  • New rules can be added
  • Supports future enhancements

4. Data-Driven

  • Configuration via YAML files
  • Human-readable formats
  • Easy to modify and version

5. Traceable

  • Every decision documented
  • Confidence scores tracked
  • Unknown words logged

πŸ”„ System Evolution

Version History

V1.0 (Initial)

  • Basic glyph mappings
  • ~50 word dictionary
  • Single context translation

V2.0 (Polysemy)

  • Context-aware translation
  • Polysemy system
  • ~300 word dictionary

V3.0 (Morphology)

  • Prefix/suffix handling
  • Morphological analysis
  • ~450 word dictionary

V4.0 (Systematic)

  • Word family generation
  • Duplicate cleanup
  • English translation
  • ~708 word dictionary

V5.0 (AI Agent)

  • Complete helper scripts
  • Workflow automation
  • Configuration system
  • Documentation suite

πŸ“Š Technical Metrics

Current System Stats

  • Lines of Code: ~3,500 (Python)
  • Configuration: ~2,000 (YAML)
  • Documentation: ~15,000 words
  • Dictionary Entries: 708 words
  • Polysemy Entries: 10 words
  • Supported Contexts: 5 (herbal, astronomical, biological, pharmaceutical, cosmological)
  • Helper Scripts: 7 tools
  • Folios Processed: 22 pages
  • Total Words Analyzed: 6,655
  • Unique Words Identified: 1,060

πŸš€ Future Architecture Enhancements

Planned Improvements

  1. Machine Learning Layer

    • Auto-suggest morphological decompositions
    • Predict word meanings from context
    • Pattern recognition for compounds
  2. Visual Integration

    • OCR for direct image processing
    • Image-to-text correlation
    • Botanical species identification
  3. Database Backend

    • SQL database for dictionary
    • Query optimization
    • Version control for entries
  4. Web Interface

    • Browser-based translation tool
    • Interactive vocabulary editor
    • Visualization dashboard
  5. API Layer

    • RESTful API for translations
    • Integration with other tools
    • Batch processing endpoints

πŸ”— Dependencies

Python Requirements

python >= 3.8
httpx >= 0.24.0
pyyaml >= 6.0
pathlib (standard library)
json (standard library)
argparse (standard library)

External Resources

  • voynich.nu: Source of EVA transcriptions
  • YAML: Configuration format
  • EVA Alphabet: Standard transcription system

πŸ“š References

Technical Standards

  • EVA (European Voynich Alphabet): Standard transcription system
  • YAML 1.2: Configuration file format
  • JSON: Data interchange format
  • UTF-8: Character encoding

Research Foundation

  • Currier's A/B Dialects: Glyph variation patterns
  • Stolfi's Entropy Analysis: Information content studies
  • Tiltman's Word Length: Statistical observations
  • Takahashi's EVA: Digital transcription standard

System Status: OPERATIONAL βœ…
Architecture Version: 5.0
Last Updated: November 27, 2025

For usage instructions, see DEVELOPMENT_GUIDE.md.
For research results, see RESEARCH_RESULTS.md.
For AI agent instructions, see AI_RESEARCH_GUIDE.md.