Skip to content

Start from scratch, document txt epub docx working pdf not working because need libreoffice, and llm local#184

Draft
Balrog57 wants to merge 11 commits intoXapaJIaMnu:masterfrom
Balrog57:master
Draft

Start from scratch, document txt epub docx working pdf not working because need libreoffice, and llm local#184
Balrog57 wants to merge 11 commits intoXapaJIaMnu:masterfrom
Balrog57:master

Conversation

@Balrog57
Copy link

@Balrog57 Balrog57 commented Feb 2, 2026

translateLocally Enhanced - Documentation

A fork of translateLocally with document processing and AI-powered translation improvement.

What's New in This Fork

This enhanced version of translateLocally adds professional document translation capabilities and AI-powered post-editing while maintaining full compatibility with the original project.

Major Improvements Over Original

Feature Original translateLocally This Enhanced Fork
Document Support Plain text only (stdin/stdout) DOCX, EPUB, PDF, TXT with structure preservation
Large File Handling 10 MB hard limit Automatic splitting for unlimited file sizes
Translation Quality Marian neural translation only Optional AI improvement via local/cloud LLMs
Structure Preservation N/A Full XHTML/HTML/XML structure preservation
GUI Document Processing Not available File menu, progress dialogs, real-time updates
EPUB Support Not available Professional e-book translation with DOM preservation
AI Integration Not available 5 providers (Ollama, LM Studio, OpenAI, Claude, Gemini)
Progress Tracking Basic console output Real-time GUI progress bars with segment tracking

Key Technical Innovations

1. Document Processing Architecture

  • DocumentSplitter: Intelligent segmentation of documents into ≤8MB chunks

    • TXT: Paragraph-based splitting with structure preservation
    • DOCX: ZIP/XML parsing via libarchive
    • EPUB: XHTML parsing with full DOM preservation
    • PDF: LibreOffice conversion pipeline
  • DocumentMerger: Reconstruction with perfect structure preservation

    • EPUB: Word-based text replacement algorithm preserving all HTML tags, CSS classes, and formatting
    • DOCX: Full ZIP archive rebuilding with metadata preservation
    • Archive formats: Maintains non-text content (images, stylesheets, metadata)
  • DocumentProcessor: High-level orchestration API

    • Simple workflow: open()getSegments()setTranslatedSegments()save()
    • Thread-safe design for GUI integration

2. AI-Powered Translation Improvement

  • LLMInterface: Unified abstraction for 5 AI providers

  • Synchronized Chunking: 2000-character chunks (~600-700 tokens) with source/translation alignment to prevent text mismatches

  • Sequential Processing: maxConcurrent=1 for local LLM stability

  • Optimized Prompts: Engineered to minimize verbosity and reasoning artifacts

3. GUI Integration

  • File Menu: "Open Document" and "Save Translation" actions with keyboard shortcuts
  • Settings Dialog: New "AI Improvement" tab with provider configuration
  • Document Translation Dialog: Real-time progress bars for translation and AI improvement phases
  • Worker Thread Architecture: Non-blocking UI during long-running translations

4. Structure Preservation Strategy

  • EPUB XHTML Preservation: Paragraph-by-paragraph replacement approach

    • Preserves paragraph structure: <p>, heading tags <h1>-<h6>
    • Maintains CSS classes, stylesheet links, and document-level formatting
    • Keeps chapter structure, TOC, metadata, cover images
    • Trade-off: Inline formatting (<b>, <i>, <span>) within paragraphs is removed to ensure clean text replacement without mangling
  • DOCX Structure Preservation: Similar paragraph-level approach

    • Preserves paragraph properties and overall document structure
    • Maintains document metadata, images, and non-text content
    • Simplifies text content within each paragraph for reliable translation
    • Trade-off: Complex inline formatting may be simplified, but paragraph-level structure remains intact

5. Settings Persistence

New configuration parameters in QSettings:

llmEnabled          // Enable/disable AI improvement
llmProvider         // "Ollama", "LM Studio", "OpenAI", "Claude", "Gemini"
llmUrl              // Provider endpoint URL
llmModel            // Model identifier
openaiApiKey        // OpenAI authentication
claudeApiKey        // Claude authentication
geminiApiKey        // Gemini authentication

Files Added to Original Codebase

Core Document Processing (8 files):

GUI Integration (3 files):

Modified Files:

Total Addition: ~2,200 lines of new code

Compatibility with Original

This fork maintains 100% backward compatibility with the original translateLocally:

  • All original CLI commands work unchanged
  • Existing models and settings are preserved
  • Original GUI functionality intact
  • New features are purely additive (opt-in via --ai-improve flag or GUI actions)
  • Same build system, dependencies, and distribution model

Table of Contents

Quick Start

Basic Document Translation

Translate a Word document from English to French:

translateLocally -m en-fr-tiny -i document.docx -o document_fr.docx

With AI Improvement

Improve translation quality using a local AI model (requires Ollama or LM Studio):

translateLocally -m en-fr-tiny -i document.docx -o document_fr.docx --ai-improve

GUI Document Translation

  1. File → Open Document (Ctrl+O)
  2. Select DOCX, EPUB, PDF, or TXT file
  3. Choose output path
  4. Click "Start Translation"
  5. Watch real-time progress bars for translation and AI improvement

Supported Formats

TXT (Plain Text)

  • Splitting strategy: By paragraphs
  • Structure preservation: Maintains paragraph breaks and whitespace
  • Maximum segment: 8 MB per segment
  • Best for: Simple text files, articles, books

DOCX (Microsoft Word)

  • Processing method: ZIP archive extraction and XML parsing
  • Structure preservation: Full formatting, styles, embedded images, metadata
  • Technical details: Parses word/document.xml via libarchive
  • Best for: Formatted documents, reports, letters

EPUB (E-books)

  • Processing method: ZIP archive extraction and XHTML parsing with paragraph detection
  • Structure preservation: Document-level structure maintained (chapters, headings, paragraphs, CSS stylesheets, metadata, cover images)
  • Technical details: Parses content XHTML files via libarchive, detects paragraph boundaries (<p>, <h1>-<h6>), preserves paragraph tags while replacing text content
  • Trade-off: Inline formatting within paragraphs (<b>, <i>, <em>, <strong>) is removed to ensure correct translation without text mangling
  • Quality: Clean, readable output with reliable paragraph structure - suitable for e-books where content matters more than inline formatting
  • Best for: E-books, novels, articles, documentation where paragraph structure is more important than bold/italic formatting

PDF (Portable Document Format)

  • Processing method: PDF → DOCX conversion via LibreOffice, then DOCX workflow
  • Requirements: LibreOffice must be installed with soffice.exe in PATH
  • Limitations: Complex layouts may not convert perfectly
  • Best for: Simple PDF documents, text-heavy PDFs

Document Translation

How It Works

  1. Document Split: Large documents are automatically split into segments of max 8 MB

    • This bypasses the internal 10 MB processing limit
    • Segments maintain original structure identifiers for correct reconstruction
  2. Translation: Each segment is translated using the selected Marian model

    • Progress is shown for each segment
    • Original formatting is preserved
  3. Document Reconstruction: Translated segments are merged back

    • Original structure and metadata are maintained
    • Archive-based formats (DOCX, EPUB) preserve all non-text content

Example: Translating Large Documents

For documents larger than 10 MB, automatic splitting ensures smooth processing:

# Translate a 50 MB book from Spanish to English
translateLocally -m es-en-base -i libro_grande.epub -o big_book.epub

The document will be automatically split into ~7 segments, translated individually, and reassembled.

AI-Powered Improvement

What Is AI Improvement?

AI improvement uses large language models (LLMs) to:

  • Fix awkward machine translation phrasing
  • Improve naturalness and fluency
  • Correct context-dependent errors
  • Maintain consistent terminology

Supported AI Providers

Local AI (Recommended for Privacy)

Ollama (Free, runs locally)

  1. Install Ollama: https://ollama.com/download
  2. Pull a model: ollama pull mistral
  3. Configure in translateLocally:
    • Provider: Ollama
    • URL: http://localhost:11434
    • Model: mistral

LM Studio (Free, runs locally)

  1. Install LM Studio: https://lmstudio.ai/
  2. Load a model and start the local server
  3. Configure in translateLocally:
    • Provider: LM Studio
    • URL: http://localhost:1234
    • Model: Your loaded model name

Cloud AI (Requires API Keys)

OpenAI (GPT-3.5, GPT-4)

  • Provider: OpenAI
  • URL: https://api.openai.com
  • Model: gpt-4o-mini (recommended) or gpt-4o
  • Requires: OpenAI API key

Claude (Anthropic)

  • Provider: Claude
  • URL: https://api.anthropic.com
  • Model: claude-3-5-sonnet-20241022 (recommended)
  • Requires: Claude API key

Gemini (Google)

  • Provider: Gemini
  • URL: https://generativelanguage.googleapis.com
  • Model: gemini-1.5-flash or gemini-1.5-pro
  • Requires: Gemini API key

How AI Improvement Works

  1. Machine Translation: Marian translates the text first
  2. Synchronized Chunking: Translation is split into 2000-character chunks (~600-700 tokens) with source/translation alignment maintained to prevent text mismatches
  3. AI Refinement: Each chunk is sent to the AI for improvement with real-time progress updates
  4. Sequential Processing: Chunks are processed one at a time for local LLM stability
  5. Structure Preservation: For EPUBs, HTML structure is maintained while only text content is improved

Usage Example

# Translate with local Ollama
translateLocally -m en-de-base -i report.docx -o bericht.docx --ai-improve

# The workflow:
# 1. Splits report.docx if needed
# 2. Translates with Marian (en-de-base)
# 3. Refines with Ollama mistral
# 4. Reconstructs bericht.docx

Configuration

Settings Location

LLM settings are stored in:

  • Windows: HKEY_CURRENT_USER\Software\translateLocally\translateLocally
  • Linux: ~/.config/translateLocally/translateLocally.conf
  • macOS: ~/Library/Preferences/com.translateLocally.translateLocally.plist

Available Settings

Setting Description Example
llmEnabled Enable AI improvement true or false
llmProvider AI provider name Ollama, LM Studio, OpenAI, Claude, Gemini
llmUrl Provider endpoint http://localhost:11434
llmModel Model identifier mistral, gpt-4o-mini, claude-3-5-sonnet-20241022
openaiApiKey OpenAI API key sk-...
claudeApiKey Claude API key sk-ant-...
geminiApiKey Gemini API key AI...

Manual Configuration Example

For Ollama with Mistral model, add to settings file:

[General]
llmEnabled=true
llmProvider=Ollama
llmUrl=http://localhost:11434
llmModel=mistral

GUI Configuration

  1. Open translateLocally
  2. Go to Settings (Edit → Preferences)
  3. Navigate to "AI Improvement" tab
  4. Configure:
    • Enable AI improvement checkbox
    • Select provider from dropdown
    • Enter server URL (for local providers)
    • Select or enter model name
    • Enter API key (for cloud providers)
  5. Click "Test Connection" to verify
  6. Click "Apply" to save

Advanced Usage

Batch Processing Multiple Documents

Translate all DOCX files in a directory:

for file in *.docx; do
  translateLocally -m en-fr-tiny -i "$file" -o "${file%.docx}_fr.docx"
done

Pivot Translation with AI

Translate Spanish → English → German with AI improvement at each step:

# Step 1: Spanish to English with AI
translateLocally -m es-en-base -i documento.txt -o document_en.txt --ai-improve

# Step 2: English to German with AI
translateLocally -m en-de-base -i document_en.txt -o dokument.txt --ai-improve

Choosing the Right Model Size

For speed:

  • Use tiny models: Fast but lower quality
  • Example: en-fr-tiny

For quality:

  • Use base models: Slower but better quality
  • Example: en-fr-base

For AI improvement:

  • Start with tiny Marian + AI improvement for best speed/quality balance
  • The AI will fix most tiny model errors

Testing AI Improvement

Compare translations with and without AI:

# Without AI
translateLocally -m en-fr-tiny -i test.txt -o test_fr_noai.txt

# With AI
translateLocally -m en-fr-tiny -i test.txt -o test_fr_ai.txt --ai-improve

# Compare the outputs
diff test_fr_noai.txt test_fr_ai.txt

Important Notes on Formatting Preservation

What Is Preserved

Document structure: Chapters, sections, table of contents
Paragraph boundaries: Headings (<h1>-<h6>), paragraphs (<p>) maintain proper structure
Document-level formatting: Stylesheets, CSS classes, fonts, page layout
Non-text content: Images, cover art, metadata, embedded files
Archive integrity: DOCX and EPUB ZIP structure fully maintained

What Is Not Preserved

Inline text formatting: Bold, italic, underline, font changes within paragraphs
Spans and inline styles: <b>, <i>, <em>, <strong>, <span> tags within text
Complex inline structures: Nested formatting, hyperlinks within text (document structure links preserved)

Why This Limitation Exists

This is a deliberate technical choice to ensure translation quality:

Problem with inline formatting preservation:
When attempting to preserve inline formatting (e.g., keeping <b>bold words</b> bold), we encountered severe issues:

  • Word-sticking: Translated words appeared without spaces ("desmodules" instead of "des modules")
  • Misaligned formatting: Wrong words received bold/italic (e.g., <b>simple test</b> became <b>document de</b>)
  • Duplicate/missing text: When word counts differed between languages, text appeared twice or disappeared
  • Mangled structure: Sentences split mid-word, tags broken (<it instead of <i>)

Current solution:

  • Replace entire paragraph content as a unit
  • Guarantees correct, readable translations without text corruption
  • Maintains document readability and structure integrity
  • Trade-off: Inline formatting within paragraphs is removed

Best for:

  • E-books where content readability matters most
  • Documents where paragraph structure is more important than bold/italic
  • Translations where accuracy is critical

Not ideal for:

  • Documents with critical formatting (e.g., legal documents with specific bolded clauses)
  • Presentations with heavily formatted text
  • Marketing materials where visual formatting is essential

For these cases, consider manual post-processing to re-apply formatting based on the original document.

Troubleshooting

Document Processing Issues

Problem: "Failed to open document"

  • Cause: Unsupported format or corrupted file
  • Solution: Verify file format, try opening in original application first

Problem: "Segment too large" error

  • Cause: Single segment (e.g., huge table) exceeds 8 MB
  • Solution: Manually split the document into smaller files

Problem: PDF processing fails

  • Cause: LibreOffice not installed or not in PATH
  • Solution: Install LibreOffice and ensure soffice.exe is accessible

Problem: DOCX/EPUB structure corrupted after translation

  • Cause: Complex document with unusual formatting FIXED in commit 99e292c
  • Solution: Latest version uses paragraph-by-paragraph approach that prevents text mangling, word-sticking, and duplicate text issues. Structure is preserved at paragraph level (headings, paragraphs maintain proper boundaries). Note: Inline formatting (<b>, <i>) within paragraphs is removed as a trade-off for correct text replacement.

AI Improvement Issues

Problem: AI improvement not working

  • Cause: LLM provider not running or incorrectly configured
  • Solution:
    • For Ollama: Verify with ollama list and ollama ps
    • For LM Studio: Check local server is started
    • For cloud APIs: Verify API key is correct

Problem: "Connection refused" error

  • Cause: Local LLM not running
  • Solution: Start Ollama or LM Studio before running translateLocally

Problem: AI makes translation worse

  • Cause: Wrong model or prompt configuration
  • Solution:
    • Try a different model (e.g., mistralllama3)
    • For cloud APIs, use recommended models (GPT-4o-mini, Claude Sonnet)

Problem: Slow AI processing

  • Cause: Local LLM running on CPU without acceleration
  • Solution:
    • Use smaller models (mistral instead of llama3:70b)
    • Consider cloud APIs for speed
    • Reduce concurrent processing in code

Problem: API rate limiting errors

  • Cause: Cloud provider rate limits exceeded
  • Solution:
    • Wait and retry
    • Use local LLM for unlimited processing
    • Upgrade API tier if needed

General Troubleshooting

Enable debug output:

translateLocally -m en-fr-tiny -i test.txt --debug

This will show detailed information about:

  • Document splitting process
  • Segment sizes and count
  • Translation progress
  • AI provider requests/responses
  • Error stack traces

Check dependencies:

# Verify LibreOffice installation
soffice --version

# Verify Ollama is running
curl http://localhost:11434/api/tags

Performance Optimization

Document Processing

  • TXT files: Fastest, no archive overhead
  • DOCX/EPUB: Moderate, requires ZIP extraction
  • PDF: Slowest, requires LibreOffice conversion

Recommendation: Convert PDFs to DOCX manually before batch processing

AI Improvement

  • Local LLMs: Privacy-focused, unlimited usage, slower
  • Cloud APIs: Fast, pay-per-use, requires internet

Recommendation: Use local LLM for sensitive documents, cloud for speed

Model Selection

Quality Level Speed Model Example Use Case
Fast ⚡⚡⚡ en-fr-tiny + AI Quick drafts, chat
Balanced ⚡⚡ en-fr-base no AI Standard documents
High en-fr-base + AI Publication-quality

Examples Gallery

Example 1: Academic Paper

# Translate research paper with AI improvement for academic quality
translateLocally -m en-es-base -i paper.docx -o articulo.docx --ai-improve

Example 2: Novel Translation

# Translate e-book preserving chapter structure
translateLocally -m en-fr-base -i novel.epub -o roman.epub --ai-improve

Example 3: Business Report

# Fast translation for internal document
translateLocally -m en-de-tiny -i report.docx -o bericht.docx

Example 4: Multilingual Batch

# Translate to 3 languages with AI
for lang in fr de es; do
  translateLocally -m en-${lang}-tiny -i source.txt -o output_${lang}.txt --ai-improve
done

FAQ

Q: Can I use AI improvement without internet?
A: Yes! Install Ollama or LM Studio for completely offline AI improvement.

Q: How much does cloud AI cost?
A: Varies by provider. GPT-4o-mini costs ~$0.15 per million input tokens. A 10,000-word document costs roughly $0.002.

Q: Does AI improvement work with all language pairs?
A: Yes, but quality depends on the AI model's training. English, French, German, Spanish, Italian, Portuguese, Chinese, and Japanese typically work best.

Q: Can I disable AI improvement for specific documents?
A: Yes, simply omit the --ai-improve flag or uncheck the option in the GUI.

Q: What happens to images in DOCX/EPUB files?
A: Images and other binary content are preserved unchanged in the output document.

Q: Can I translate password-protected PDFs?
A: No, remove password protection first.

Q: Is my text sent to the internet when using local LLMs?
A: No, local LLMs (Ollama, LM Studio) process everything on your machine.

Q: Are these features compatible with the original translateLocally?
A: Yes! This is a 100% backward-compatible fork. All original functionality works unchanged. New features are purely additive.

Q: Can I contribute these features back to the original project?
A: Contributions are welcome! Submit pull requests to the upstream repository at https://github.com/XapaJIaMnu/translateLocally

Support and Contributing

For issues, feature requests, or contributions:

License

Document processing and AI improvement features are part of translateLocally and follow the same license as the original project.

Credits

  • Original translateLocally: Developed by XapaJIaMnu and contributors
  • Bergamot Translator: Mozilla's browser-based translation project
  • Marian NMT: Fast neural machine translation framework
  • This Fork: Document processing and AI improvement features

Balrog57 and others added 7 commits February 1, 2026 11:57
Features:
- Document translation support (DOCX, EPUB, PDF, TXT)
- Automatic splitting for files >10MB into 8MB segments
- Document structure and formatting preservation
- AI-powered translation refinement via LLM providers:
  * Local: Ollama, LM Studio
  * Cloud: OpenAI, Claude, Gemini
- New CLI flag: --ai-improve

Implementation:
- DocumentSplitter: Segments documents by format
- DocumentMerger: Reconstructs translated documents
- DocumentProcessor: High-level workflow orchestrator
- LLMInterface: Unified AI provider abstraction
- Settings: Added 7 LLM configuration parameters

Documentation:
- CLAUDE.md: Developer guide with architecture notes
- DOCUMENT_PROCESSING.md: Complete user guide with examples
- Implementation_Roadmap.md: Feature implementation plan

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Features:
- File menu with Open Document, Save Translation, and Exit actions
- DocumentTranslationDialog with progress tracking for translation and AI improvement
- Settings tab for LLM configuration (Ollama, LM Studio, OpenAI, Claude, Gemini)
- Connection testing and model discovery for local LLM providers
- Worker thread pattern for non-blocking UI during document translation

LLM improvements:
- Synchronized chunking (2000 chars) to prevent source/translation mismatches
- Proportional splitting of long lines to maintain alignment
- Ultra-short prompts optimized for 32k context models
- Enhanced error reporting with detailed HTTP status and response bodies

Tested with:
- LM Studio (qwen/qwen3-4b-2507)
- EPUB translation (Overlord Vol. 16, 7.3MB) successfully completed
- CLI and GUI modes both functional

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit addresses two critical issues with document translation:

1. EPUB Structure Preservation (CRITICAL FIX)
   - DocumentSplitter now stores original XHTML content alongside extracted text
   - DocumentMerger uses new replaceTextInXhtml() to preserve HTML DOM structure
   - Only text nodes are replaced; all HTML tags, CSS links, classes preserved
   - Before: Structure destroyed, CSS lost, paragraphs merged
   - After: Full structure preserved (<h1>, <h2>, <p> tags, main.css link, etc.)

2. GUI Progress Bar Improvements
   - DocumentTranslationDialog now connects to LLMInterface::verificationProgress
   - Real-time chunk-level updates during AI improvement
   - Status shows "AI improving segment X of Y (chunk Z/W)..."
   - Before: Progress bar froze during LLM processing
   - After: Smooth real-time updates showing actual progress

Technical Details:
- Added Segment.originalXhtml field to preserve structure
- Implemented word-based proportional text replacement algorithm
- Maintains source/translation synchronization across chunks
- AI improvement remains disabled by default (llmEnabled=false)

Files Modified:
- src/DocumentSplitter.h/cpp: Store original XHTML
- src/DocumentMerger.h/cpp: Preserve structure during merge
- src/DocumentTranslationDialog.cpp: Connect progress signals
- CLAUDE.md: Updated documentation

Tested with:
- Overlord Vol. 16 EPUB (7.3MB, 24 segments)
- Structure fully preserved (headings, CSS, paragraphs)
- Progress bar updates in real-time during LLM processing

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Delete Implementation_Roadmap.md (no longer needed, features complete)
- Rename DOCUMENT_PROCESSING.md → DOCUMENTATION.md
- Add comprehensive "What's New in This Fork" section highlighting:
  * Document processing (DOCX, EPUB, PDF, TXT) with auto-splitting
  * AI-powered improvement (5 providers: Ollama, LM Studio, OpenAI, Claude, Gemini)
  * Full XHTML/HTML structure preservation for EPUB
  * Real-time GUI progress tracking
  * Synchronized chunking algorithm (2000 chars)
  * Complete file list with line counts
- Maintain full user guide with quick start, configuration, troubleshooting
- Emphasize 100% backward compatibility with original translateLocally
- Add comparison table: Original vs Enhanced Fork

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Remove .claude/settings.local.json (local configuration, should not be in repo)
- Update .gitignore to exclude:
  * .claude/ directory (Claude Code local settings)
  * SESSION_LOG.md and *.patch (temporary files)
  * Test files (*_test.*, test_*.*, *_translated.*)

This ensures only source code and documentation are tracked in the repository.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
EPUB fixes:
- DocumentSplitter now detects paragraph boundaries (<p>, <h1>-<h6>) and preserves them with newlines, matching DOCX behavior
- Previously concatenated all text with spaces, losing structure
- Simplified replaceTextInXhtml to replace entire paragraph content while keeping tag structure
- Trade-off: inline formatting (<b>, <i>) lost but translations correct

DOCX improvements:
- Added replaceTextInWordXml helper for paragraph-by-paragraph text replacement
- Preserves paragraph properties and structure while simplifying text content
- Prevents word-sticking, formatting misalignment, and duplicate text issues

Both formats now use consistent paragraph-level approach for reliable document reconstruction.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
CLAUDE.md updates:
- Clarified that paragraph-by-paragraph approach is used for both EPUB and DOCX
- Updated implementation details to explain trade-off: inline formatting removed but paragraph structure preserved
- Added technical reasoning in Known Limitations section

DOCUMENTATION.md updates:
- Updated "Structure Preservation Strategy" section with accurate description of current approach
- Revised EPUB format description to clarify limitations and benefits
- Fixed troubleshooting entry about structure corruption with commit reference
- Added comprehensive "Important Notes on Formatting Preservation" section explaining:
  * What is preserved (document structure, paragraphs, metadata, images)
  * What is not preserved (inline formatting like bold/italic within paragraphs)
  * Technical reasoning behind the design choice
  * Examples of problematic behaviors from word-by-word approach
  * Use case recommendations

These updates provide transparency about the deliberate trade-off made to ensure
correct, readable translations without text mangling, word-sticking, or duplicate text issues.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Balrog57 and others added 4 commits February 2, 2026 15:34
This commit implements PDF translation using Poppler C++ API as an
alternative to LibreOffice conversion, providing faster and more
reliable PDF text extraction.

Key changes:

- CMakeLists.txt: Add Poppler detection and linking for vcpkg
  * Auto-detect Poppler C++ library in vcpkg installations
  * Link with poppler-cpp.lib and poppler.lib
  * Define HAVE_POPPLER when available

- DocumentSplitter.cpp/h: Implement Poppler-based PDF extraction
  * New splitPdfWithPoppler() function using Poppler C++ API
  * Extract text page-by-page with UTF-8 conversion
  * Use "pdf_poppler_" identifier prefix for Poppler segments
  * Fallback to LibreOffice when Poppler unavailable

- DocumentProcessor.cpp: Smart PDF output handling
  * Detect Poppler vs LibreOffice PDF processing
  * Output TXT for Poppler (text-only extraction)
  * Output DOCX for LibreOffice (preserves structure)

Benefits:
- No LibreOffice dependency required
- Faster PDF processing (direct text extraction)
- Qt version independent (uses C++ API, not Qt wrapper)
- Works with existing LLM improvement pipeline

Tested with A2-English-test-with-answers.pdf:
- Basic translation: working
- AI-improved translation: excellent quality

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…Zip Slip), performance optimizations (word count), and UI/UX improvements. Cleaned up temporary files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant