Skip to content

Commit 4f9b733

Browse files
authored
Merge pull request #997 from dlyongemallo/main
begin schema migration
2 parents 86c2f8f + ef0a84d commit 4f9b733

File tree

6,210 files changed

+353824
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

6,210 files changed

+353824
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
qawHaq.db
2+
qawHaq.json
23
mem.sql
34
mem.xml
45
mem_processed.xml

CLAUDE.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
Klingon language dictionary data files for **boQwI'** and associated apps. This is a git submodule containing XML source files that are compiled into the SQLite database used by the Android app.
8+
9+
## File Organization
10+
11+
XML source files organized by Klingon letter:
12+
- `mem-01-b.xml` through `mem-25-u.xml` - Main lexicon entries by letter
13+
- `mem-26-suffixes.xml` - Grammatical suffixes
14+
- `mem-27-extra.xml` - Non-canon/uncertain entries, transliterations, movie sentences
15+
- `mem-28-examples.xml` - Pedagogical examples and complex word searches
16+
- `mem-00-header.xml` / `mem-29-footer.xml` - XML structure wrappers
17+
18+
## Build Commands
19+
20+
### XML Pipeline (Original)
21+
```bash
22+
# Generate database (interactive mode with diff review)
23+
./generate_db.sh
24+
25+
# Generate database non-interactively (used by Android build)
26+
./generate_db.sh --noninteractive
27+
28+
# Generate only the combined XML file (for debugging)
29+
./generate_db.sh --xmlonly
30+
31+
# Review changes for a specific language before PR
32+
./review_changes.sh <lang_code> [commit]
33+
```
34+
35+
### YAML Pipeline (New)
36+
```bash
37+
# Generate database from YAML sources (interactive mode)
38+
./generate_db_yaml.sh
39+
40+
# Generate database non-interactively (for CI/CD)
41+
./generate_db_yaml.sh --noninteractive
42+
```
43+
44+
The YAML pipeline reads from `entries/*.yaml` and produces identical output to the XML pipeline. See `MIGRATION_STATUS.md` for details on the migration.
45+
46+
The `generate_db.sh` script validates entries and checks for:
47+
- Missing German/Portuguese/Finnish definitions
48+
- Broken entry references
49+
- Misplaced spaces/commas
50+
- Missing translations (fields containing "TRANSLATE")
51+
- New `{ngh}`/`{ngH}` entries or two-letter verbs (require parser updates)
52+
53+
## Entry Guidelines
54+
55+
- Use `blank.xml` template when adding new entries
56+
- `entry_name` must exactly match the original source (important for KWOTD matching)
57+
- `notes` fields are for "in-universe" information; `hidden_notes` for meta/out-of-universe info
58+
- Full sentences should have final punctuation
59+
- Link to other entries only once per entry; use `nolink` tag for subsequent references
60+
- Translations can take liberties to convey meaning; words in brackets/quotes may need `search_tags`
61+
62+
## Parser Caveats
63+
64+
The Android parser has hardcoded lists that must be updated when adding certain entries:
65+
66+
1. **{ngh}/{ngH} entries**: The sequence "ngh" is ambiguous in xifan hol mode (could be **n**+**gh** or **ng**+**H**). Update the hardcoded list in `KlingonContentDatabase.java`.
67+
68+
2. **Two-letter verbs**: Short queries (≤4 letters) have special handling. Update the hardcoded list when adding 2-letter verbs.
69+
70+
The `generate_db.sh` script outputs warnings when entries matching these criteria are added or changed.
71+
72+
## Translation Workflow
73+
74+
- Run `call_google_translate.py` to auto-translate fields containing "TRANSLATE"
75+
- Commits with manual translations should change only one language
76+
- Use "Squash and merge" for large translation PRs
77+
- Run `review_changes.sh <lang>` before submitting PRs
78+
79+
## Dictionary Generation
80+
81+
The `build/` directory contains tools for generating dictionary outputs:
82+
83+
```bash
84+
# Generate E-K dictionary (Markdown/JSON)
85+
python3 build/ek_generator.py
86+
87+
# Generate LaTeX dictionary (K-E and E-K sections)
88+
python3 build/latex_generator.py > build/dictionary.tex
89+
```
90+
91+
**E-K Generator** produces:
92+
- `build/ek_dictionary.md` - Markdown format for print dictionary
93+
- `build/ek_index.json` - JSON format for apps
94+
95+
**LaTeX Generator** produces:
96+
- `build/dictionary.tex` - Complete K-E and E-K dictionary in LaTeX
97+
- Sections: base, ficnames (fictional names), loanwords, places
98+
99+
Key files:
100+
- `build/definition_parser.py` - Parses definitions into structured parts
101+
- `build/ek_generator.py` - Generates E-K permutations (Markdown/JSON)
102+
- `build/latex_generator.py` - Generates LaTeX output from YAML
103+
104+
Definition flags in YAML entries:
105+
- `no_permute: true` - Guard cases that should NOT be split (birds, exclamations)
106+
- `dedup: true` - Prevent nearby duplicate E-K entries (actor/actress)
107+
- `etc_suffix: true` - Definition ends with ", etc."

MIGRATION_STATUS.md

Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
# Migration Status
2+
3+
**Last Updated:** 2026-01-11
4+
5+
This document tracks the progress of migrating the Klingon lexicon database from XML to YAML format, as specified in SPEC.md.
6+
7+
## Current Status: Phase 5 In Progress
8+
9+
The YAML pipeline is now functional and produces **byte-identical output** to the XML pipeline for both SQLite (qawHaq.db) and JSON (qawHaq.json) formats. E-K dictionary generation is now implemented.
10+
11+
## Completed Work
12+
13+
### Phase 1: Source Registry ✓
14+
- Created `sources.yaml` with source metadata
15+
- Source parsing integrated into migration script
16+
- Sources are stored in structured format within entry YAML files
17+
18+
### Phase 2: Entry File Structure ✓
19+
- **6,142 YAML files** containing **6,407 entries** migrated from XML
20+
- Directory structure: `entries/{pos_type}/{first_letter}/{entry}.yaml`
21+
- Suffixes stored separately: `entries/suffixes/{verb|noun}/`
22+
- Migration script: `build/migrate_xml.py`
23+
- YAML parsers: `build/yaml2sql.py`, `build/yaml2json.py`
24+
- Build script: `generate_db_yaml.sh`
25+
26+
### Shared Notes
27+
- **18 shared note files** extracted to `notes/` directory
28+
- Notes that appear in 3+ entries are stored as reusable references
29+
30+
## File Structure
31+
32+
```
33+
data/
34+
├── entries/ # 6,142 YAML entry files
35+
│ ├── verbs/
36+
│ ├── nouns/
37+
│ ├── adverbials/
38+
│ ├── conjunctions/
39+
│ ├── questions/
40+
│ ├── sentences/
41+
│ ├── exclamations/
42+
│ └── suffixes/
43+
│ ├── verb/
44+
│ └── noun/
45+
├── notes/ # 18 shared note files
46+
├── sources.yaml # Source registry
47+
├── build/
48+
│ ├── migrate_xml.py # XML to YAML migration
49+
│ ├── yaml2sql.py # YAML to SQL generation
50+
│ ├── yaml2json.py # YAML to JSON generation
51+
│ ├── definition_parser.py # Parse definitions for E-K
52+
│ ├── source_parser.py
53+
│ ├── ek_generator.py # Generate E-K dictionary (Markdown/JSON)
54+
│ ├── latex_generator.py # Generate LaTeX dictionary
55+
│ ├── ek_dictionary.md # E-K output (Markdown)
56+
│ ├── ek_index.json # E-K output (JSON)
57+
│ └── dictionary.tex # LaTeX output (K-E + E-K)
58+
├── generate_db_yaml.sh # New YAML-based build script
59+
├── generate_db.sh # Original XML-based build script (still works)
60+
└── mem-*.xml # Original XML source files (retained for now)
61+
```
62+
63+
## Build Commands
64+
65+
### Using YAML Pipeline (New)
66+
```bash
67+
# Generate database from YAML sources
68+
./generate_db_yaml.sh
69+
70+
# Non-interactive mode (for CI/CD)
71+
./generate_db_yaml.sh --noninteractive
72+
```
73+
74+
### Using XML Pipeline (Original)
75+
```bash
76+
# Generate database from XML sources
77+
./generate_db.sh
78+
79+
# Non-interactive mode
80+
./generate_db.sh --noninteractive
81+
```
82+
83+
### Generating E-K Dictionary
84+
```bash
85+
# Generate E-K dictionary from YAML entries
86+
python3 build/ek_generator.py
87+
88+
# Outputs:
89+
# build/ek_dictionary.md - Markdown for print
90+
# build/ek_index.json - JSON for apps
91+
```
92+
93+
### Generating LaTeX Dictionary
94+
```bash
95+
# Generate complete LaTeX dictionary (K-E + E-K)
96+
python3 build/latex_generator.py > build/dictionary.tex
97+
98+
# Sections: base, ficnames, loanwords, places
99+
```
100+
101+
## Verification
102+
103+
The YAML pipeline has been verified to produce identical output:
104+
105+
| Output | YAML Pipeline | XML Pipeline | Match |
106+
|--------|---------------|--------------|-------|
107+
| qawHaq.db entries | 6,407 | 6,407 ||
108+
| qawHaq.json size | 6,496,002 bytes | 6,496,002 bytes ||
109+
| SQL dump diff | 0 lines | - ||
110+
| JSON diff | 0 lines | - ||
111+
112+
## Remaining Work
113+
114+
### Phase 3: Definition Structure ✓
115+
- ✓ Parse definitions into structured parts for E-K dictionary
116+
- ✓ Extract global parentheticals
117+
- ✓ Add sort keyword support
118+
- ✓ Guard cases (`no_permute` flag) for definitions that should not be split
119+
- ✓ Deduplication (`dedup` flag) to prevent nearby duplicate E-K entries
120+
- Pending: Human review of ambiguous parses
121+
122+
### Phase 4: Notes and Examples (Future)
123+
- Further extraction of shared notes
124+
- Structure examples with stanzas
125+
- Update entry references
126+
127+
### Phase 5: E-K Generation ✓
128+
- ✓ Implement E-K permutation generation (`build/ek_generator.py`)
129+
- ✓ Be-verb format: "X, be X, be Y, be Z"
130+
- ✓ Print dictionary output (`build/ek_dictionary.md`)
131+
- ✓ JSON index output (`build/ek_index.json`)
132+
- ✓ TKD convention formatting: _English_**Klingon**
133+
- ✓ LaTeX dictionary generator (`build/latex_generator.py`)
134+
- Generates K-E and E-K sections
135+
- Handles all sections: base, ficnames, loanwords, places
136+
137+
**E-K Statistics:**
138+
- 6,437 entries loaded
139+
- 8,102 E-K lookup entries generated
140+
- 1,389 entries with multiple parts (permutations)
141+
- 29 guard cases (no_permute)
142+
- 91 dedup cases
143+
144+
**LaTeX Output Statistics:**
145+
- 5,300 base entries
146+
- 111 fictional names
147+
- 416 loanwords
148+
- 249 places
149+
150+
### Phase 6: Validation and Testing (Future)
151+
- Implement all validation rules from SPEC.md
152+
- Performance testing
153+
154+
### Phase 7: Migration Completion (Future)
155+
- Remove XML source files
156+
- Update documentation
157+
- Update contributor guide
158+
159+
## Entry YAML Format
160+
161+
Each entry file contains either a single entry or multiple homophones:
162+
163+
```yaml
164+
# Single entry
165+
entry:
166+
entry_name: "bach"
167+
slug: "bach_v"
168+
part_of_speech: "v:t_c,klcp1,weap"
169+
pos: "v"
170+
pos_subtype: "t_c"
171+
status: "active"
172+
definition: "shoot"
173+
notes: "..."
174+
sources:
175+
raw: "[1] {TKD:src}"
176+
citations:
177+
- source: tkd
178+
translations:
179+
de:
180+
definition: "schießen"
181+
# ... other languages
182+
_original_id: 10002
183+
```
184+
185+
## Notes
186+
187+
1. The `_original_id` field preserves the entry's ID from the original XML for backward compatibility.
188+
189+
2. The `part_of_speech` field retains the original combined format for backward compatibility, while `pos`, `pos_subtype`, `categories`, and `metadata_tags` provide the parsed components.
190+
191+
3. The `sources` field contains both the `raw` original text and parsed `citations` for future use.
192+
193+
4. XML source files are retained during the transition period but are no longer the source of truth for the YAML pipeline.
194+
195+
5. The `definition` field can be a simple string or a structured object with:
196+
- `text`: The full definition text
197+
- `parts`: Array of definition parts for E-K permutation
198+
- `global_parenthetical`: Parenthetical that applies to all parts
199+
- `no_permute`: Guard flag to prevent E-K permutation (for birds, exclamations, etc.)
200+
- `dedup`: Flag to prevent nearby duplicate E-K entries (for actor/actress, etc.)
201+
- `etc_suffix`: Flag indicating definition ends with ", etc."

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,11 @@
11
klingon-assistant-data
22
======================
33

4+
**Note:** The database is undergoing a schema migration. The following applies to
5+
the current schema. There will be a period during which data will be in both
6+
formats, with automatic conversion between them. After the migration is
7+
complete, this README will be updated to reflect the new schema.
8+
49
Klingon language data files for **boQwI'** and associated apps.
510

611
The `notes` fields are for typical users of the lexicon. An attempt should be

0 commit comments

Comments
 (0)