Skip to content

Commit 805515d

Browse files
authored
Merge pull request #17176 from ethereum/i18n/import/2026-01-27T15-06-08-vi
i18n: automated Crowdin translation import (vi)
2 parents e94b46b + 2f7bfa4 commit 805515d

File tree

311 files changed

+56837
-974
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

311 files changed

+56837
-974
lines changed
Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
---
2+
title: "Translation href Sync Issues - Links Corrupted During Crowdin Translation"
3+
date: "2026-02-17"
4+
category: "integration-issues"
5+
tags:
6+
- translation
7+
- i18n
8+
- crowdin
9+
- link-integrity
10+
- glossary
11+
- html-structure
12+
- json-translations
13+
component: "src/intl/ translation JSON files"
14+
severity: "high"
15+
symptoms:
16+
- "Glossary links rendering as plain text instead of clickable anchors"
17+
- "Crowdin numbered placeholders (<0>, <1>) appearing in rendered content"
18+
- "Links pointing to wrong glossary entries"
19+
- "Duplicate nested <a> tags causing malformed HTML"
20+
- "Extra links present in translations that don't exist in English"
21+
---
22+
23+
# Translation href Sync Issues
24+
25+
## Problem
26+
27+
Translation PRs imported from Crowdin frequently contain corrupted `<a href="...">` tags in JSON translation files (`src/intl/{locale}/*.json`). The canonical English JSON files embed HTML links inside translation string values (e.g., `<a href="/glossary/#validator">validator</a>`). Translators on Crowdin introduce five categories of errors:
28+
29+
1. **Placeholder substitution**: `<a href="/glossary/#defi">DeFi</a>` becomes `<0>DeFi</0>` (Crowdin numbered placeholder)
30+
2. **Link removal**: `<a href="/glossary/#key">keys</a>` becomes plain text `khoa`
31+
3. **Wrong targets**: `<a href="/glossary/#node">Node</a>` becomes `<a href="/glossary/#validator">tuy chon</a>`
32+
4. **Nested/duplicate tags**: `<a href="..."><a href="...">text</a>`
33+
5. **Extra links added**: Links present in translation but absent in English canonical
34+
35+
## First Occurrence
36+
37+
PR #17176 (Vietnamese translations) - 13 href issues across 4 files:
38+
- `src/intl/vi/page-roadmap.json` (6 issues)
39+
- `src/intl/vi/page-staking.json` (6 issues)
40+
- `src/intl/vi/glossary-tooltip.json` (1 issue)
41+
- `src/intl/vi/glossary.json` (1 issue)
42+
43+
## Investigation
44+
45+
### Step 1: Identify changed files
46+
```bash
47+
git diff dev --name-only -- 'src/intl/vi/**/*.json'
48+
```
49+
50+
### Step 2: Automated comparison script
51+
For each changed JSON file, flatten the JSON, extract all `href="..."` values from both the English (`src/intl/en/`) and translated versions, and compare using symmetric set difference:
52+
53+
```python
54+
import json, re, os
55+
56+
def extract_urls(value):
57+
return re.findall(r'href="([^"]*)"', value)
58+
59+
def flatten(data, prefix=''):
60+
items = {}
61+
if isinstance(data, dict):
62+
for k, v in data.items():
63+
nk = f'{prefix}.{k}' if prefix else k
64+
if isinstance(v, (dict, list)):
65+
items.update(flatten(v, nk))
66+
elif isinstance(v, str):
67+
items[nk] = v
68+
elif isinstance(data, list):
69+
for i, v in enumerate(data):
70+
nk = f'{prefix}[{i}]'
71+
if isinstance(v, (dict, list)):
72+
items.update(flatten(v, nk))
73+
elif isinstance(v, str):
74+
items[nk] = v
75+
return items
76+
77+
# For each file, compare EN vs translated href sets per key
78+
# Also check for: nested <a> tags, Crowdin placeholders (<0>, <1>)
79+
```
80+
81+
### Step 3: Cross-check patterns
82+
- Nested anchors: `re.search(r'<a [^>]*><a [^>]*>', value)`
83+
- Crowdin placeholders where EN has real links: `re.search(r'<\d+>', vi_value)` when `re.search(r'<a href=', en_value)` is true
84+
85+
## Root Cause
86+
87+
1. **Crowdin editor behavior**: When translators restructure sentences, Crowdin converts `<a href="...">` tags into numbered placeholders (`<0>`, `<1>`) automatically
88+
2. **Translator misunderstanding**: Translators don't realize HTML href values must remain unchanged
89+
3. **Copy-paste errors**: Manual editing creates duplicate/nested anchor tags
90+
4. **No JSON href validation**: The post-import sanitizer (`src/scripts/i18n/post_import_sanitize.ts`) validates hrefs in Markdown files but performs zero href checking on JSON translation values
91+
92+
## Solution
93+
94+
For each affected key:
95+
96+
1. Read the English canonical value from `src/intl/en/[file].json`
97+
2. Read the translated value from `src/intl/{locale}/[file].json`
98+
3. Restore the exact `<a href="...">` structure from English while keeping translated display text
99+
4. Remove any extra links not present in English
100+
5. Fix nested `<a>` tags by removing duplicates
101+
102+
### Example fix
103+
104+
**English** (`page-staking.json`):
105+
```json
106+
"page-staking-section-comparison-pools-rewards-li3": "Liquidity tokens can be held in your own wallet, used in <a href=\"/glossary/#defi\">DeFi</a> and sold..."
107+
```
108+
109+
**Vietnamese BEFORE** (link removed):
110+
```json
111+
"page-staking-section-comparison-pools-rewards-li3": "Token thanh khoản được lưu trữ trong ví riêng của bạn, được sử dụng trong DeFi và bán đi..."
112+
```
113+
114+
**Vietnamese AFTER** (link restored):
115+
```json
116+
"page-staking-section-comparison-pools-rewards-li3": "Token thanh khoản được lưu trữ trong ví riêng của bạn, được sử dụng trong <a href=\"/glossary/#defi\">DeFi</a> và bán đi..."
117+
```
118+
119+
## Prevention
120+
121+
### Priority 1: Extend the sanitizer for JSON href validation
122+
123+
The post-import sanitizer at `src/scripts/i18n/post_import_sanitize.ts` already has robust href validation for Markdown (`fixTranslatedHrefs`, lines 232-401). The `processJsonFile` function (lines 1273-1306) only does BOM normalization, smart quote replacement, and JSON parse validation. It performs zero href checking.
124+
125+
Add a `validateJsonHrefs` step to `processJsonFile` that:
126+
- Loads the corresponding English JSON file
127+
- Extracts `href="..."` values from both EN and translated strings per key
128+
- Flags missing, extra, wrong, nested, or placeholder hrefs
129+
- Auto-fixes unambiguous cases (single mismatch per key)
130+
131+
### Priority 2: CI validation gate
132+
133+
Add a GitHub Actions check on PRs touching `src/intl/` that fails when href count mismatches, Crowdin placeholders, or nested anchors are detected. This should be a required status check on `dev` branch protection.
134+
135+
### Priority 3: Crowdin configuration
136+
137+
- Set JSON files to treat `<a href="...">` as protected tag pairs
138+
- Enable built-in "Tags mismatch" and "Broken URLs" QA checks
139+
- Add custom placeholder patterns for `href="[^"]*"` as non-editable tokens
140+
141+
### Priority 4: Reviewer checklist
142+
143+
When reviewing any PR touching `src/intl/`:
144+
- [ ] Anchor tag count parity per JSON key (EN vs translated)
145+
- [ ] No Crowdin numbered placeholders in output
146+
- [ ] No nested `<a>` tags
147+
- [ ] All `href="..."` values unchanged from English
148+
- [ ] No extra or missing links vs English
149+
150+
## Related Files
151+
152+
- `src/scripts/i18n/post_import_sanitize.ts` - Post-import sanitizer (needs JSON href support)
153+
- `src/scripts/i18n/lib/workflows/sanitization.ts` - Sanitization workflow runner
154+
- `.claude/commands/review-translations.md` - Translation review slash command
155+
- `.github/workflows/claude-review-translations.yml` - CI translation review workflow
156+
- `docs/header-ids.md` - Related: header IDs must also not be translated
157+
- `docs/solutions/translation-review/crowdin-import-review-vietnamese-pr-17176.md` - Full PR #17176 review post-mortem
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
---
2+
title: Fix double-escaping of backslash-escaped angle brackets in MDX sanitizer
3+
date: 2026-02-28
4+
category: logic-errors
5+
component: Translation post-import sanitizer
6+
tags:
7+
- regex
8+
- mdx
9+
- escaping
10+
- i18n
11+
- sanitizer
12+
- translation
13+
- angle-brackets
14+
severity: high
15+
recurring: true
16+
languages_affected:
17+
- vi
18+
- cs
19+
- fr
20+
- ru
21+
files_modified:
22+
- src/scripts/i18n/post_import_sanitize.ts
23+
- tests/unit/sanitizer/standalone-fixes.spec.ts
24+
- public/content/translations/cs/developers/tutorials/how-to-mint-an-nft/index.md
25+
- public/content/translations/fr/developers/tutorials/how-to-mint-an-nft/index.md
26+
- public/content/translations/ru/developers/docs/networking-layer/portal-network/index.md
27+
- public/content/translations/ru/developers/tutorials/reverse-engineering-a-contract/index.md
28+
- public/content/translations/vi/developers/docs/networking-layer/portal-network/index.md
29+
- public/content/translations/vi/developers/tutorials/how-to-mint-an-nft/index.md
30+
- public/content/translations/vi/developers/tutorials/reverse-engineering-a-contract/index.md
31+
---
32+
33+
# Fix double-escaping of backslash-escaped angle brackets in MDX sanitizer
34+
35+
## Problem Symptom
36+
37+
Translated markdown files contained `\&lt;1 GB RAM` instead of the correct `\<1 GB RAM`. This rendered as literal `\&lt;` in the browser rather than the intended `<` character. The pattern appeared across multiple languages (vi, cs, fr, ru) in files like `portal-network/index.md`, `how-to-mint-an-nft/index.md`, and `reverse-engineering-a-contract/index.md`.
38+
39+
The English source uses `\<1` as a valid MDX backslash escape for `<` before digits. After running the sanitizer, translations gained the extra `&lt;` entity, producing a double-escape.
40+
41+
## Root Cause Analysis
42+
43+
The `escapeMdxAngleBrackets` function in `src/scripts/i18n/post_import_sanitize.ts` (line 1558) used the following regex:
44+
45+
```typescript
46+
// BUGGY:
47+
parts[i] = parts[i].replace(/(?<!&lt|&)<(\d)/g, (_, digit) => {
48+
fixCount++
49+
return `&lt;${digit}`
50+
})
51+
```
52+
53+
The negative lookbehind `(?<!&lt|&)` excluded:
54+
- `&lt` -- already HTML-entity-escaped (e.g., `&lt;1`)
55+
- `&` -- ampersand prefix
56+
57+
It did **NOT** exclude `\` (backslash). When MDX content contained a valid backslash escape like `\<1 GB RAM`, the regex matched the `<` because `\` was not in the lookbehind. The replacement transformed `\<1` into `\&lt;1` -- a double-escape.
58+
59+
This bug was introduced because backslash-escaping of `<` is a less common MDX pattern than entity-escaping. The original regex only anticipated the two most common preceding characters (`&lt` and `&`) but missed the third valid escape prefix.
60+
61+
## Working Solution
62+
63+
### The Fix
64+
65+
**File:** `src/scripts/i18n/post_import_sanitize.ts` (line 1558)
66+
67+
Added `\\` to the negative lookbehind:
68+
69+
```typescript
70+
// BEFORE (buggy):
71+
parts[i] = parts[i].replace(/(?<!&lt|&)<(\d)/g, (_, digit) => {
72+
73+
// AFTER (fixed):
74+
parts[i] = parts[i].replace(/(?<!&lt|&|\\)<(\d)/g, (_, digit) => {
75+
```
76+
77+
The `\\` in the regex source represents a literal `\` character in the lookbehind, so the pattern now reads: "match `<` followed by a digit, but NOT if preceded by `&lt`, `&`, or `\`".
78+
79+
### Tests Added
80+
81+
Two new unit tests in `tests/unit/sanitizer/standalone-fixes.spec.ts`:
82+
83+
```typescript
84+
test("does not escape < that is already backslash-escaped", () => {
85+
const input =
86+
"Accessible to resource-constrained devices (\\<1 GB RAM, \\<100 MB disk space, 1 CPU)"
87+
const { content, fixCount } = escapeMdxAngleBrackets(input)
88+
expect(content).toBe(input)
89+
expect(fixCount).toBe(0)
90+
})
91+
92+
test("does not escape backslash-escaped < before single digit", () => {
93+
const input = "do the same in \\<10 minutes"
94+
const { content, fixCount } = escapeMdxAngleBrackets(input)
95+
expect(content).toBe(input)
96+
expect(fixCount).toBe(0)
97+
})
98+
```
99+
100+
All 131 tests pass (129 existing + 2 new).
101+
102+
### Translation Files Repaired
103+
104+
Reverted `\&lt;` back to `\<` in 7 files across 4 languages:
105+
106+
| Language | File |
107+
|----------|------|
108+
| cs | `developers/tutorials/how-to-mint-an-nft/index.md` |
109+
| fr | `developers/tutorials/how-to-mint-an-nft/index.md` |
110+
| ru | `developers/docs/networking-layer/portal-network/index.md` |
111+
| ru | `developers/tutorials/reverse-engineering-a-contract/index.md` |
112+
| vi | `developers/docs/networking-layer/portal-network/index.md` |
113+
| vi | `developers/tutorials/how-to-mint-an-nft/index.md` |
114+
| vi | `developers/tutorials/reverse-engineering-a-contract/index.md` |
115+
116+
## Prevention Strategies
117+
118+
### 1. Lookbehind Completeness Checklist
119+
120+
For every negative lookbehind `(?<!...)` in the sanitizer, verify coverage of **all** escape character families:
121+
- `\` (backslash -- markdown/MDX escape)
122+
- `&` and `&lt;`, `&amp;`, `&#` (HTML entities)
123+
- `` ` `` (backtick -- code context)
124+
125+
### 2. Edge Case Test Matrix
126+
127+
Test these input patterns for any angle bracket escaping function:
128+
129+
| Input | Expected | Risk |
130+
|-------|----------|------|
131+
| `\<1` | unchanged | Backslash escape |
132+
| `&lt;1` | unchanged | Entity escape |
133+
| `&<1` | unchanged | Ampersand prefix |
134+
| `<1` | `&lt;1` | Bare angle bracket |
135+
| `\<10\<20` | unchanged | Multiple escapes |
136+
| `\\<1` | `\\&lt;1` | Double backslash (literal `\` + bare `<`) |
137+
138+
### 3. Process Improvements
139+
140+
- **Dry-run diff viewer:** Run sanitizer in dry-run mode before committing, showing before/after for every file so reviewers can spot double-escaping visually.
141+
- **Regex audit:** Periodically grep for `(?<!` in sanitizer files and verify each lookbehind covers backslash escapes.
142+
- **Cross-language regression:** When a pattern is found in one language, scan all other languages for the same pattern before closing the issue.
143+
144+
## Related Documentation
145+
146+
- [Crowdin Translation Sanitizer MDX Fence Bugs](../build-errors/crowdin-translation-sanitizer-mdx-fence-bugs.md) -- Patterns 12-15 covering `escapeMdxAngleBrackets` bugs
147+
- [Post-Import Sanitizer Regex Bugs: Whitespace Handling](./post-import-sanitizer-regex-bugs-whitespace-handling.md) -- Sibling regex bugs in the same sanitizer
148+
- [Sanitizer Test Research](../integration-issues/sanitizer-test-research.md) -- Comprehensive pattern catalog (Patterns 1-16)
149+
- [Known Patterns](~/.claude/translation-review/known-patterns.md) -- Pattern 6: Double-Escaping in MDX
150+
151+
## Commits
152+
153+
- `e6fa15813e` -- fix(i18n): fix backslash-escape double-encoding (regex fix + 2 tests + 7 file repairs)
154+
- `dd44c06a36` -- fix(i18n): review vi translations PR #17176 (bulk Vietnamese translation fixes)

0 commit comments

Comments
 (0)