Skip to content

Commit ec5d732

Browse files
authored
feat(translation): complete translation parity review batches (#144)
* chore(i18n): update locale code maps and parity scripts * feat(notion-fetch): enforce deterministic locale grouping and category generation * feat(notion-fetch): improve locale section fallback and callout transforms * feat(notion-fetch): preserve callout children and harden title extraction * feat(notion-translate): preserve rich text and add conversion diagnostics * test(translation): add locale parity harness and tracking docs * fix(i18n): normalize es/pt translation key parity * Codex-generated pull request (#146) * fix(parity): harden tokenizer and icon stripping behavior * docs(translation): audit resolved and remaining parity issues * fix(callout): support locale punctuation separators * fix(translation): close remaining parity harness and typing gaps * fix(translation): resolve remaining merge-blocking type issues * fix(test): align notion translation env bootstrap with runtime validation * fix(parity): avoid nested-list and setext tokenization false positives * fix(translation): correct callout delimiter and less language mapping * fix(parity): preserve accented callouts and nested-list parent order * fix(i18n): remove cross-language entries swapped into es and pt locale files Portuguese strings were appended to i18n/es/code.json and Spanish strings to i18n/pt/code.json. All the misplaced entries are duplicates of keys already present in the correct locale file, so they can be removed outright. * fix(i18n): remove self-referential translated keys from es and pt locale files Six entries in es/code.json used Spanish text as both key and value, and six entries in pt/code.json used Portuguese text as both key and value. Three of each were duplicates of correctly English-keyed entries; the remaining three were placeholder Notion labels (New Page, New Section Title, New Toggle, Planejamento e Preparação para um Projeto) that had no English counterpart. Removing all 12 reduces the cross-locale key diff from 12 to 0, unblocking the verify-locale-output.test.ts parity check (max allowed: 3). * chore(i18n): stop tracking locale files on feature branch i18n/ is gitignored and belongs only on the content branch. Previous commits on this branch incorrectly force-added and patched i18n/es/code.json and i18n/pt/code.json directly. The content branch does not have those bad entries, so no fix is needed there. Remove the files from git tracking so the gitignore rule is respected. * ci(test): fetch i18n content from content branch before running tests The i18n/ directory is gitignored and lives only on the content branch. Tests in verify-locale-output.test.ts require locale files to be present, so mirror the pattern used by the deploy-pr-preview workflow: git checkout origin/content -- i18n/ Falls back gracefully (|| echo) if the content branch is unavailable. * docs: update _category_.json creation to include all locales Correction reflects code change in sectionProcessors.ts that removed the English-only guard. 🤖 Generated with [Qoder][https://qoder.com] * refactor: remove unused createAnnotations function Dead code identified in PR review - function was defined but never called. 🤖 Generated with [Qoder][https://qoder.com] * fix(translate): add table support and rate limit guard for block deletion - Import remark-gfm so GFM tables are parsed (without it, table syntax falls through to plain-text paragraphs) - Add TableNode/TableRowNode/TableCellNode types and a 'table' case in the markdown-to-Notion switch, producing a Notion table block with inline table_row children and has_column_header based on row count - Add 100ms delay between individual block delete calls to avoid hitting Notion rate limits when updating pages with many blocks - Add 3 tests covering full table conversion, single-row tables, and the no-paragraph-fallback invariant * fix(ci,scripts): harden translation root/filter and frontmatter handling
1 parent 023dad4 commit ec5d732

25 files changed

+4235
-530
lines changed

.github/workflows/test.yml

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,32 @@ jobs:
2323
with:
2424
bun-version: latest
2525

26+
- name: Checkout i18n content from content branch
27+
id: i18n_content
28+
shell: bash
29+
run: |
30+
if git checkout origin/content -- i18n/; then
31+
echo "has_i18n=true" >> "$GITHUB_OUTPUT"
32+
echo "✅ Loaded i18n content from origin/content."
33+
else
34+
echo "has_i18n=false" >> "$GITHUB_OUTPUT"
35+
echo "⚠️ origin/content checkout failed. Locale-dependent tests will be skipped."
36+
fi
37+
2638
- name: Install dependencies
2739
run: bun install
2840

2941
- name: Rebuild sharp for CI environment
3042
run: npm rebuild sharp
3143

3244
- name: Run tests
33-
run: bun run test
45+
shell: bash
46+
run: |
47+
if [ "${{ steps.i18n_content.outputs.has_i18n }}" = "true" ]; then
48+
echo "✅ Running full test suite (including locale-dependent tests)."
49+
bun run test
50+
else
51+
echo "⚠️ Skipping locale-dependent tests because origin/content checkout failed."
52+
echo "Skipped: scripts/locale-parity.test.ts, scripts/verify-locale-output.test.ts"
53+
bunx vitest run --pool=threads --exclude scripts/locale-parity.test.ts --exclude scripts/verify-locale-output.test.ts
54+
fi

README.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ This repository uses a **two-branch architecture** to separate code from generat
1818
- **`content` branch**: Generated documentation from Notion (docs/, i18n/, static/images/ ~29MB)
1919

2020
**Why separate branches?**
21+
2122
- Keeps main branch clean for code review and development
2223
- Reduces repository clone time for contributors
2324
- Separates content syncs from code changes
@@ -28,14 +29,17 @@ This repository uses a **two-branch architecture** to separate code from generat
2829
Before local development, you need content files. Choose one of these methods:
2930

3031
**Option 1: Fetch from content branch** (Recommended - Fast)
32+
3133
```bash
3234
git fetch origin content
3335
git checkout origin/content -- docs/ i18n/ static/images/
3436
```
3537

3638
**Option 2: Generate from Notion** (Requires API access)
39+
3740
1. Copy `.env.example` to `.env` and add your Notion API key and Database ID
3841
2. Fetch content:
42+
3943
```bash
4044
bun notion:fetch
4145
```
@@ -48,7 +52,7 @@ The `bun notion:fetch` script pulls structured content from Notion and rewrites
4852
- **Sub-page grouping**: Parents must link their language variants through the `Sub-item` relation. Each linked child should set `Language` to `English`, `Spanish`, or `Portuguese`. Any other language values are ignored.
4953
- **Element Type field drives layout**:
5054
- `Element Type = Page` exports markdown, regenerates frontmatter, rewrites remote images under `static/images/`, and tracks compression savings for the summary.
51-
- `Element Type = Toggle` creates a folder (plus `_category_.json` for English) and increments the “section folders” counter.
55+
- `Element Type = Toggle` creates a folder (plus `_category_.json` for all locales) and increments the “section folders” counter.
5256
- `Element Type = Heading` stores the heading for the next `Page` entry’s sidebar metadata and increments the “title sections applied” counter.
5357
- **Summary counters**: The totals printed at the end reflect the actions above. Zeros mean no matching work occurred (for example, no toggles, no headings, or no images to optimize).
5458
- **Translations**: When a non-English child page is processed, its title is written to `i18n/<locale>/code.json` using the parent’s English title as the key. Ensure those files exist before running the script.
@@ -68,6 +72,7 @@ bun dev
6872
This command opens your browser automatically and reflects changes immediately.
6973

7074
**Full local setup from scratch:**
75+
7176
```bash
7277
# Clone repository
7378
git clone https://github.com/digidem/comapeo-docs.git
@@ -99,6 +104,7 @@ The resulting files are placed in the `build` directory for deployment via any s
99104
#### How Deployment Works
100105

101106
Deployments use a **checkout strategy**:
107+
102108
1. Checkout `main` branch (code and scripts)
103109
2. Overlay content files from `content` branch (docs, i18n, images)
104110
3. Build the site with merged content
@@ -221,24 +227,28 @@ The repository includes several automated workflows for content management:
221227
#### Content Workflows (Push to `content` branch)
222228

223229
**Sync Notion Docs** (`sync-docs.yml`)
230+
224231
- **Trigger**: Manual dispatch or repository dispatch
225232
- **Purpose**: Fetches content from Notion and commits to `content` branch
226233
- **Target Branch**: `content`
227234
- **Environment**: Requires `NOTION_API_KEY` and `DATABASE_ID` secrets
228235

229236
**Translate Docs** (`translate-docs.yml`)
237+
230238
- **Trigger**: Manual dispatch or repository dispatch
231239
- **Purpose**: Generates translations and commits to `content` branch
232240
- **Target Branch**: `content`
233241
- **Environment**: Requires `NOTION_API_KEY`, `DATABASE_ID`, `OPENAI_API_KEY`
234242

235243
**Fetch All Content for Testing** (`notion-fetch-test.yml`)
244+
236245
- **Trigger**: Manual dispatch with optional force mode
237246
- **Purpose**: Tests complete content fetch from Notion
238247
- **Target Branch**: `content`
239248
- **Features**: Retry logic, detailed statistics, content validation
240249

241250
**Clean All Generated Content** (`clean-content.yml`)
251+
242252
- **Trigger**: Manual dispatch with confirmation
243253
- **Purpose**: Removes all generated content from `content` branch
244254
- **Target Branch**: `content`
@@ -247,11 +257,13 @@ The repository includes several automated workflows for content management:
247257
#### Deployment Workflows (Read from both branches)
248258

249259
**Deploy to Staging** (`deploy-staging.yml`)
260+
250261
- **Trigger**: Push to `main`, manual dispatch, or after content sync
251262
- **Process**: Checkout `main` + overlay `content` → build → deploy to GitHub Pages
252263
- **URL**: https://digidem.github.io/comapeo-docs
253264

254265
**Deploy to Production** (`deploy-production.yml`)
266+
255267
- **Trigger**: Push to `main` or manual dispatch
256268
- **Process**: Checkout `main` + overlay `content` → build → deploy to Cloudflare Pages
257269
- **URL**: https://docs.comapeo.app
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# Translation Improvements Progress Tracker
2+
3+
Date started: 2026-02-19
4+
Branch: `feat/translation-parity-improvements`
5+
Worktree: `.worktrees/translation-parity-improvements`
6+
7+
## Goal
8+
9+
Ensure auto-translation output is structurally equivalent across `en`, `pt`, and `es` markdown for every translated page family.
10+
11+
Allowed differences:
12+
13+
- media asset URLs/files/paths (images and other media links)
14+
15+
Required exact parity (after excluding media only):
16+
17+
- emojis (presence, position, and semantics)
18+
- markdown structure and formatting
19+
- heading hierarchy and order
20+
- admonitions/callouts
21+
- list structure and ordering
22+
- code fences and inline code tokens
23+
- non-media link structure and target parity
24+
- frontmatter key set consistency (translated values allowed for text fields)
25+
26+
## Current Snapshot (Baseline)
27+
28+
Baseline check date: 2026-02-19
29+
30+
Verified failing family:
31+
32+
- Root page: `2331b081-62d5-80a1-810e-dbb15a2e0f68`
33+
- EN: `docs/gathering-the-right-equipment-for-comapeo.md`
34+
- PT: `i18n/pt/docusaurus-plugin-content-docs/current/reunindo-o-equipamento-certo-para-o-comapeo.md`
35+
- ES: `i18n/es/docusaurus-plugin-content-docs/current/nueva-pgina.md`
36+
37+
Observed mismatches:
38+
39+
- heading levels/count/order mismatch
40+
- list nesting/ordering mismatch
41+
- non-media link mismatch
42+
- ES body effectively empty vs EN/PT content
43+
44+
## Success Criteria
45+
46+
The translation pipeline is considered fixed only when all criteria pass:
47+
48+
1. For sampled and targeted translated families, parity checks pass for EN/PT/ES after media normalization.
49+
2. No generated PT/ES markdown files are empty when EN source contains non-empty content.
50+
3. Emoji conversion and placement is preserved across locales.
51+
4. Markdown formatting survives translation roundtrip without structural loss.
52+
5. CI/local test suite contains automated parity checks that fail on regressions.
53+
54+
## Verification Workflow (Repeatable)
55+
56+
### Step 1: Generate translation-child outputs from Notion
57+
58+
Targeted family:
59+
60+
```bash
61+
bun run notion:fetch-auto-translation-children -- --page-id <root_page_id>
62+
```
63+
64+
Batch mode:
65+
66+
```bash
67+
bun run notion:fetch-auto-translation-children
68+
```
69+
70+
This writes:
71+
72+
- `.cache/auto-translation-children-comparison.md`
73+
74+
### Step 2: Build EN/PT/ES file triplets
75+
76+
Source of truth:
77+
78+
- `.cache/auto-translation-children-comparison.md`
79+
80+
Select rows where EN/PT/ES all exist.
81+
82+
### Step 3: Run parity comparison (media-normalized)
83+
84+
Comparison must validate:
85+
86+
- headings
87+
- lists
88+
- admonitions/callouts
89+
- code fences and inline code
90+
- non-media links
91+
- frontmatter key set
92+
- emoji retention
93+
94+
Current implementation status:
95+
96+
- manual/subagent-driven parity analysis exists
97+
- automated parity script and tests still needed (tracked in research + implementation backlog)
98+
99+
### Step 4: Record results in this tracker
100+
101+
For each run, append:
102+
103+
- date/time
104+
- command used
105+
- families compared
106+
- pass/fail counts
107+
- failure categories
108+
- links to changed code/tests
109+
110+
## Run Log
111+
112+
| Date | Scope | Compared Families | Pass | Fail | Notes |
113+
| ---------- | ---------------------------------------------------- | ----------------: | ---: | ---: | ---------------------------------------------------------- |
114+
| 2026-02-19 | Targeted root `2331b081-62d5-80a1-810e-dbb15a2e0f68` | 1 | 0 | 1 | ES output empty/partial mismatch; structural parity failed |
115+
116+
## Implementation Backlog (Pipeline Hardening)
117+
118+
- [ ] Add deterministic parity checker script for EN/PT/ES triplets (media-normalized).
119+
- [ ] Add Vitest coverage for parity checker with golden fixtures.
120+
- [ ] Harden `scripts/notion-translate/markdownToNotion.ts` against block loss.
121+
- [ ] Harden markdown generation path (`scripts/notion-fetch/*`) for locale consistency.
122+
- [ ] Add CI gate for parity regressions on sampled families.
123+
124+
## Guardrails
125+
126+
- Do not patch generated markdown by hand.
127+
- Fix generation code and rerun pipeline.
128+
- Treat empty translated markdown as critical failure.
129+
- Preserve existing media exception only; no broader exceptions.

0 commit comments

Comments
 (0)