|
| 1 | +# Migration Status |
| 2 | + |
| 3 | +**Last Updated:** 2026-01-11 |
| 4 | + |
| 5 | +This document tracks the progress of migrating the Klingon lexicon database from XML to YAML format, as specified in SPEC.md. |
| 6 | + |
| 7 | +## Current Status: Phase 5 In Progress |
| 8 | + |
| 9 | +The YAML pipeline is now functional and produces **byte-identical output** to the XML pipeline for both SQLite (qawHaq.db) and JSON (qawHaq.json) formats. E-K dictionary generation is now implemented. |
| 10 | + |
| 11 | +## Completed Work |
| 12 | + |
| 13 | +### Phase 1: Source Registry ✓ |
| 14 | +- Created `sources.yaml` with source metadata |
| 15 | +- Source parsing integrated into migration script |
| 16 | +- Sources are stored in structured format within entry YAML files |
| 17 | + |
| 18 | +### Phase 2: Entry File Structure ✓ |
| 19 | +- **6,142 YAML files** containing **6,407 entries** migrated from XML |
| 20 | +- Directory structure: `entries/{pos_type}/{first_letter}/{entry}.yaml` |
| 21 | +- Suffixes stored separately: `entries/suffixes/{verb|noun}/` |
| 22 | +- Migration script: `build/migrate_xml.py` |
| 23 | +- YAML parsers: `build/yaml2sql.py`, `build/yaml2json.py` |
| 24 | +- Build script: `generate_db_yaml.sh` |
| 25 | + |
| 26 | +### Shared Notes |
| 27 | +- **18 shared note files** extracted to `notes/` directory |
| 28 | +- Notes that appear in 3+ entries are stored as reusable references |
| 29 | + |
| 30 | +## File Structure |
| 31 | + |
| 32 | +``` |
| 33 | +data/ |
| 34 | +├── entries/ # 6,142 YAML entry files |
| 35 | +│ ├── verbs/ |
| 36 | +│ ├── nouns/ |
| 37 | +│ ├── adverbials/ |
| 38 | +│ ├── conjunctions/ |
| 39 | +│ ├── questions/ |
| 40 | +│ ├── sentences/ |
| 41 | +│ ├── exclamations/ |
| 42 | +│ └── suffixes/ |
| 43 | +│ ├── verb/ |
| 44 | +│ └── noun/ |
| 45 | +├── notes/ # 18 shared note files |
| 46 | +├── sources.yaml # Source registry |
| 47 | +├── build/ |
| 48 | +│ ├── migrate_xml.py # XML to YAML migration |
| 49 | +│ ├── yaml2sql.py # YAML to SQL generation |
| 50 | +│ ├── yaml2json.py # YAML to JSON generation |
| 51 | +│ ├── definition_parser.py # Parse definitions for E-K |
| 52 | +│ ├── source_parser.py |
| 53 | +│ ├── ek_generator.py # Generate E-K dictionary (Markdown/JSON) |
| 54 | +│ ├── latex_generator.py # Generate LaTeX dictionary |
| 55 | +│ ├── ek_dictionary.md # E-K output (Markdown) |
| 56 | +│ ├── ek_index.json # E-K output (JSON) |
| 57 | +│ └── dictionary.tex # LaTeX output (K-E + E-K) |
| 58 | +├── generate_db_yaml.sh # New YAML-based build script |
| 59 | +├── generate_db.sh # Original XML-based build script (still works) |
| 60 | +└── mem-*.xml # Original XML source files (retained for now) |
| 61 | +``` |
| 62 | + |
| 63 | +## Build Commands |
| 64 | + |
| 65 | +### Using YAML Pipeline (New) |
| 66 | +```bash |
| 67 | +# Generate database from YAML sources |
| 68 | +./generate_db_yaml.sh |
| 69 | + |
| 70 | +# Non-interactive mode (for CI/CD) |
| 71 | +./generate_db_yaml.sh --noninteractive |
| 72 | +``` |
| 73 | + |
| 74 | +### Using XML Pipeline (Original) |
| 75 | +```bash |
| 76 | +# Generate database from XML sources |
| 77 | +./generate_db.sh |
| 78 | + |
| 79 | +# Non-interactive mode |
| 80 | +./generate_db.sh --noninteractive |
| 81 | +``` |
| 82 | + |
| 83 | +### Generating E-K Dictionary |
| 84 | +```bash |
| 85 | +# Generate E-K dictionary from YAML entries |
| 86 | +python3 build/ek_generator.py |
| 87 | + |
| 88 | +# Outputs: |
| 89 | +# build/ek_dictionary.md - Markdown for print |
| 90 | +# build/ek_index.json - JSON for apps |
| 91 | +``` |
| 92 | + |
| 93 | +### Generating LaTeX Dictionary |
| 94 | +```bash |
| 95 | +# Generate complete LaTeX dictionary (K-E + E-K) |
| 96 | +python3 build/latex_generator.py > build/dictionary.tex |
| 97 | + |
| 98 | +# Sections: base, ficnames, loanwords, places |
| 99 | +``` |
| 100 | + |
| 101 | +## Verification |
| 102 | + |
| 103 | +The YAML pipeline has been verified to produce identical output: |
| 104 | + |
| 105 | +| Output | YAML Pipeline | XML Pipeline | Match | |
| 106 | +|--------|---------------|--------------|-------| |
| 107 | +| qawHaq.db entries | 6,407 | 6,407 | ✓ | |
| 108 | +| qawHaq.json size | 6,496,002 bytes | 6,496,002 bytes | ✓ | |
| 109 | +| SQL dump diff | 0 lines | - | ✓ | |
| 110 | +| JSON diff | 0 lines | - | ✓ | |
| 111 | + |
| 112 | +## Remaining Work |
| 113 | + |
| 114 | +### Phase 3: Definition Structure ✓ |
| 115 | +- ✓ Parse definitions into structured parts for E-K dictionary |
| 116 | +- ✓ Extract global parentheticals |
| 117 | +- ✓ Add sort keyword support |
| 118 | +- ✓ Guard cases (`no_permute` flag) for definitions that should not be split |
| 119 | +- ✓ Deduplication (`dedup` flag) to prevent nearby duplicate E-K entries |
| 120 | +- Pending: Human review of ambiguous parses |
| 121 | + |
| 122 | +### Phase 4: Notes and Examples (Future) |
| 123 | +- Further extraction of shared notes |
| 124 | +- Structure examples with stanzas |
| 125 | +- Update entry references |
| 126 | + |
| 127 | +### Phase 5: E-K Generation ✓ |
| 128 | +- ✓ Implement E-K permutation generation (`build/ek_generator.py`) |
| 129 | +- ✓ Be-verb format: "X, be X, be Y, be Z" |
| 130 | +- ✓ Print dictionary output (`build/ek_dictionary.md`) |
| 131 | +- ✓ JSON index output (`build/ek_index.json`) |
| 132 | +- ✓ TKD convention formatting: _English_ — **Klingon** |
| 133 | +- ✓ LaTeX dictionary generator (`build/latex_generator.py`) |
| 134 | + - Generates K-E and E-K sections |
| 135 | + - Handles all sections: base, ficnames, loanwords, places |
| 136 | + |
| 137 | +**E-K Statistics:** |
| 138 | +- 6,437 entries loaded |
| 139 | +- 8,102 E-K lookup entries generated |
| 140 | +- 1,389 entries with multiple parts (permutations) |
| 141 | +- 29 guard cases (no_permute) |
| 142 | +- 91 dedup cases |
| 143 | + |
| 144 | +**LaTeX Output Statistics:** |
| 145 | +- 5,300 base entries |
| 146 | +- 111 fictional names |
| 147 | +- 416 loanwords |
| 148 | +- 249 places |
| 149 | + |
| 150 | +### Phase 6: Validation and Testing (Future) |
| 151 | +- Implement all validation rules from SPEC.md |
| 152 | +- Performance testing |
| 153 | + |
| 154 | +### Phase 7: Migration Completion (Future) |
| 155 | +- Remove XML source files |
| 156 | +- Update documentation |
| 157 | +- Update contributor guide |
| 158 | + |
| 159 | +## Entry YAML Format |
| 160 | + |
| 161 | +Each entry file contains either a single entry or multiple homophones: |
| 162 | + |
| 163 | +```yaml |
| 164 | +# Single entry |
| 165 | +entry: |
| 166 | + entry_name: "bach" |
| 167 | + slug: "bach_v" |
| 168 | + part_of_speech: "v:t_c,klcp1,weap" |
| 169 | + pos: "v" |
| 170 | + pos_subtype: "t_c" |
| 171 | + status: "active" |
| 172 | + definition: "shoot" |
| 173 | + notes: "..." |
| 174 | + sources: |
| 175 | + raw: "[1] {TKD:src}" |
| 176 | + citations: |
| 177 | + - source: tkd |
| 178 | + translations: |
| 179 | + de: |
| 180 | + definition: "schießen" |
| 181 | + # ... other languages |
| 182 | + _original_id: 10002 |
| 183 | +``` |
| 184 | +
|
| 185 | +## Notes |
| 186 | +
|
| 187 | +1. The `_original_id` field preserves the entry's ID from the original XML for backward compatibility. |
| 188 | + |
| 189 | +2. The `part_of_speech` field retains the original combined format for backward compatibility, while `pos`, `pos_subtype`, `categories`, and `metadata_tags` provide the parsed components. |
| 190 | + |
| 191 | +3. The `sources` field contains both the `raw` original text and parsed `citations` for future use. |
| 192 | + |
| 193 | +4. XML source files are retained during the transition period but are no longer the source of truth for the YAML pipeline. |
| 194 | + |
| 195 | +5. The `definition` field can be a simple string or a structured object with: |
| 196 | + - `text`: The full definition text |
| 197 | + - `parts`: Array of definition parts for E-K permutation |
| 198 | + - `global_parenthetical`: Parenthetical that applies to all parts |
| 199 | + - `no_permute`: Guard flag to prevent E-K permutation (for birds, exclamations, etc.) |
| 200 | + - `dedup`: Flag to prevent nearby duplicate E-K entries (for actor/actress, etc.) |
| 201 | + - `etc_suffix`: Flag indicating definition ends with ", etc." |
0 commit comments