Skip to content

Commit 9dd4956

Browse files
author
bram
committed
Fixed issue with whitespaces
1 parent 8d2e346 commit 9dd4956

File tree

3 files changed

+327
-11
lines changed

3 files changed

+327
-11
lines changed

docs/usage.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -426,6 +426,86 @@ gpt-po-translator --folder ./locales --lang de --no-ai-comment
426426

427427
---
428428

429+
## Whitespace Handling in Translations
430+
431+
### Overview
432+
433+
The tool automatically preserves leading and trailing whitespace from `msgid` entries in translations. While **best practice** is to handle whitespace in your UI framework rather than in translation strings, the tool ensures that any existing whitespace patterns are maintained exactly.
434+
435+
### How Whitespace Preservation Works
436+
437+
The tool uses a three-step process to reliably preserve whitespace:
438+
439+
1. **Detection and Warning**
440+
When processing PO files, the tool scans for entries with leading or trailing whitespace and logs a warning:
441+
```
442+
WARNING: Found 3 entries with leading/trailing whitespace in messages.po
443+
Whitespace will be preserved in translations, but ideally should be handled in your UI framework.
444+
```
445+
446+
2. **Before Sending to AI** (Bulk Mode)
447+
To prevent the AI from being confused by or accidentally modifying whitespace, the tool:
448+
- Strips all leading/trailing whitespace from texts
449+
- Stores the original whitespace pattern
450+
- Sends only the clean text content to the AI
451+
452+
For example, if `msgid` is `" Incorrect"` (with leading space), the AI receives only `"Incorrect"`.
453+
454+
3. **After Receiving Translation**
455+
Once the AI returns the translation, the tool:
456+
- Extracts the original whitespace pattern from the source `msgid`
457+
- Applies that exact pattern to the translated `msgstr`
458+
- Ensures the output matches the input whitespace structure
459+
460+
So `" Incorrect"` → AI translates `"Incorrect"` → Result: `" Incorreto"` (leading space preserved)
461+
462+
### Examples
463+
464+
| Original msgid | AI Receives | AI Returns | Final msgstr |
465+
|----------------|-------------|------------|--------------|
466+
| `" Hello"` | `"Hello"` | `"Bonjour"` | `" Bonjour"` |
467+
| `"World "` | `"World"` | `"Monde"` | `"Monde "` |
468+
| `" Hi "` | `"Hi"` | `"Salut"` | `" Salut "` |
469+
| `"\tTab"` | `"Tab"` | `"Onglet"` | `"\tOnglet"` |
470+
471+
### Why This Approach Is Reliable
472+
473+
This implementation is **bulletproof** because:
474+
- **The AI never sees the problematic whitespace**, so it can't strip or modify it
475+
- **Whitespace is managed entirely in code**, not reliant on AI behavior
476+
- **Works consistently across all providers** (OpenAI, Anthropic, Azure, DeepSeek)
477+
- **Handles edge cases**: empty strings, whitespace-only strings, mixed whitespace types (spaces, tabs, newlines)
478+
479+
### Single vs. Bulk Mode
480+
481+
- **Single Mode**: Each text is stripped before sending to AI, then whitespace is restored after receiving the translation
482+
- **Bulk Mode**: Entire batches are stripped before sending to AI (JSON array of clean texts), then whitespace is restored to each translation individually
483+
484+
Both modes use the same preservation logic, ensuring consistent behavior.
485+
486+
### Best Practices
487+
488+
1. **Avoid whitespace in msgid when possible**
489+
Whitespace in translation strings can cause formatting issues. Instead, handle spacing in your UI layer:
490+
```python
491+
# Bad - whitespace in msgid
492+
msgid " Settings"
493+
494+
# Good - whitespace in code
495+
print(f" {_('Settings')}")
496+
```
497+
498+
2. **If whitespace is unavoidable**
499+
The tool will preserve it automatically. Use verbose mode to see which entries contain whitespace:
500+
```bash
501+
gpt-po-translator --folder ./locales --lang fr -vv
502+
```
503+
504+
3. **Review whitespace warnings**
505+
When the tool warns about whitespace entries, consider refactoring your code to move the whitespace out of the translation strings.
506+
507+
---
508+
429509
## Behind the Scenes: API Calls and Post-Processing
430510

431511
- **Provider-Specific API Calls:**

python_gpt_po/services/translation_service.py

Lines changed: 68 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,11 @@ def perform_translation_without_validation(
206206
target_language: str,
207207
detail_language: Optional[str] = None) -> str:
208208
"""Performs translation without validation for single words or short phrases."""
209+
# Extract leading/trailing whitespace from original
210+
leading_ws = text[:len(text) - len(text.lstrip())]
211+
trailing_ws = text[len(text.rstrip()):]
212+
text_stripped = text.strip()
213+
209214
# Use the detailed language name if provided, otherwise use the short code
210215
target_lang_text = detail_language if detail_language else target_language
211216

@@ -216,7 +221,7 @@ def perform_translation_without_validation(
216221
)
217222

218223
return self.validate_translation(text, self.perform_translation(
219-
prompt + text, target_language, is_bulk=False, detail_language=detail_language
224+
prompt + text_stripped, target_language, is_bulk=False, detail_language=detail_language
220225
), target_language)
221226

222227
@staticmethod
@@ -231,6 +236,7 @@ def get_translation_prompt(target_language: str, is_bulk: bool, detail_language:
231236
"Provide only the translations in a JSON array format, maintaining the original order. "
232237
"Each translation should be concise and direct, without explanations or additional context. "
233238
"Keep special characters, placeholders, and formatting intact. "
239+
"Do NOT add or remove any leading/trailing whitespace - translate only the text content. "
234240
"If a term should not be translated (like 'URL' or technical terms), keep it as is. "
235241
"Example format: [\"Translation 1\", \"Translation 2\", ...]\n\n"
236242
"Texts to translate:\n"
@@ -253,15 +259,21 @@ def perform_translation(
253259
"""Performs the actual translation using the selected provider's API."""
254260
logging.debug("Translating to '%s' via %s API", target_language, self.config.provider.value)
255261
prompt = self.get_translation_prompt(target_language, is_bulk, detail_language)
256-
content = prompt + (json.dumps(texts) if is_bulk else texts)
262+
263+
# For bulk mode, strip whitespace before sending to AI
264+
if is_bulk:
265+
stripped_texts = [text.strip() for text in texts]
266+
content = prompt + json.dumps(stripped_texts)
267+
else:
268+
content = prompt + texts
257269

258270
try:
259271
# Get the response text from the provider
260272
response_text = self._get_provider_response(content)
261273

262274
# Process the response according to bulk mode
263275
if is_bulk:
264-
return self._process_bulk_response(response_text, texts, target_language)
276+
return self._process_bulk_response(response_text, texts, target_language, stripped_texts)
265277
return self.validate_translation(texts, response_text, target_language)
266278

267279
except Exception as e:
@@ -280,8 +292,20 @@ def _get_provider_response(self, content: str) -> str:
280292
return ""
281293
return provider_instance.translate(self.config.provider_clients, self.config.model, content)
282294

283-
def _process_bulk_response(self, response_text: str, original_texts: List[str], target_language: str) -> List[str]:
284-
"""Process a bulk translation response."""
295+
def _process_bulk_response(
296+
self,
297+
response_text: str,
298+
original_texts: List[str],
299+
target_language: str,
300+
stripped_texts: Optional[List[str]] = None) -> List[str]:
301+
"""Process a bulk translation response.
302+
303+
Args:
304+
response_text: The raw response from the AI provider
305+
original_texts: The original texts WITH whitespace
306+
target_language: Target language code
307+
stripped_texts: The stripped texts that were sent to AI (for validation)
308+
"""
285309
try:
286310
# Clean the response text for formatting issues
287311
clean_response = self._clean_json_response(response_text)
@@ -386,9 +410,19 @@ def _clean_json_response(self, response_text: str) -> str:
386410

387411
def validate_translation(self, original: str, translated: str, target_language: str) -> str:
388412
"""Validates the translation and retries if necessary."""
413+
# Extract leading/trailing whitespace from original
414+
original_stripped = original.strip()
415+
if not original_stripped:
416+
# If original is all whitespace, preserve it as-is
417+
return original
418+
419+
leading_ws = original[:len(original) - len(original.lstrip())]
420+
trailing_ws = original[len(original.rstrip()):]
421+
422+
# Strip the translation for validation
389423
translated = translated.strip()
390424

391-
if len(translated.split()) > 2 * len(original.split()) + 1:
425+
if len(translated.split()) > 2 * len(original_stripped.split()) + 1:
392426
logging.debug("Translation too verbose (%d words), retrying", len(translated.split()))
393427
return self.retry_long_translation(original, target_language)
394428

@@ -397,10 +431,16 @@ def validate_translation(self, original: str, translated: str, target_language:
397431
logging.debug("Translation contains explanation, retrying")
398432
return self.retry_long_translation(original, target_language)
399433

400-
return translated
434+
# Restore original whitespace
435+
return leading_ws + translated + trailing_ws
401436

402437
def retry_long_translation(self, text: str, target_language: str) -> str:
403438
"""Retries translation for long or explanatory responses."""
439+
# Extract leading/trailing whitespace from original
440+
leading_ws = text[:len(text) - len(text.lstrip())]
441+
trailing_ws = text[len(text.rstrip()):]
442+
text_stripped = text.strip()
443+
404444
prompt = (
405445
f"Translate this text concisely from English to {target_language}. "
406446
"Provide only the direct translation without any explanation or additional context. "
@@ -410,15 +450,16 @@ def retry_long_translation(self, text: str, target_language: str) -> str:
410450
)
411451

412452
try:
413-
content = prompt + text
414-
retried_translation = self._get_provider_response(content)
453+
content = prompt + text_stripped
454+
retried_translation = self._get_provider_response(content).strip()
415455

416-
if len(retried_translation.split()) > 2 * len(text.split()) + 1:
456+
if len(retried_translation.split()) > 2 * len(text_stripped.split()) + 1:
417457
logging.debug("Retry still too verbose, skipping")
418458
return "" # Return empty string instead of English text
419459

420460
logging.debug("Retry successful")
421-
return retried_translation
461+
# Restore original whitespace
462+
return leading_ws + retried_translation + trailing_ws
422463

423464
except Exception as e:
424465
logging.debug("Retry failed: %s", str(e)[:100])
@@ -732,6 +773,22 @@ def _prepare_translation_request(self, po_file, po_file_path, file_lang, detail_
732773
texts = [entry.msgid for entry in entries]
733774
detail_lang = detail_languages.get(file_lang) if detail_languages else None
734775

776+
# Check for and warn about whitespace in msgid
777+
whitespace_entries = [
778+
text for text in texts
779+
if text and (text != text.strip())
780+
]
781+
if whitespace_entries:
782+
logging.warning(
783+
"Found %d entries with leading/trailing whitespace in %s. "
784+
"Whitespace will be preserved in translations, but ideally should be handled in your UI framework.",
785+
len(whitespace_entries),
786+
po_file_path
787+
)
788+
if logging.getLogger().isEnabledFor(logging.DEBUG):
789+
for text in whitespace_entries[:3]: # Show first 3 examples
790+
logging.debug(" Example: %s", repr(text))
791+
735792
return TranslationRequest(
736793
po_file=po_file,
737794
entries=entries,

0 commit comments

Comments
 (0)