Start from scratch, document txt epub docx working pdf not working because need libreoffice, and llm local#184
Draft
Balrog57 wants to merge 11 commits intoXapaJIaMnu:masterfrom
Draft
Start from scratch, document txt epub docx working pdf not working because need libreoffice, and llm local#184Balrog57 wants to merge 11 commits intoXapaJIaMnu:masterfrom
Balrog57 wants to merge 11 commits intoXapaJIaMnu:masterfrom
Conversation
Features: - Document translation support (DOCX, EPUB, PDF, TXT) - Automatic splitting for files >10MB into 8MB segments - Document structure and formatting preservation - AI-powered translation refinement via LLM providers: * Local: Ollama, LM Studio * Cloud: OpenAI, Claude, Gemini - New CLI flag: --ai-improve Implementation: - DocumentSplitter: Segments documents by format - DocumentMerger: Reconstructs translated documents - DocumentProcessor: High-level workflow orchestrator - LLMInterface: Unified AI provider abstraction - Settings: Added 7 LLM configuration parameters Documentation: - CLAUDE.md: Developer guide with architecture notes - DOCUMENT_PROCESSING.md: Complete user guide with examples - Implementation_Roadmap.md: Feature implementation plan Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Features: - File menu with Open Document, Save Translation, and Exit actions - DocumentTranslationDialog with progress tracking for translation and AI improvement - Settings tab for LLM configuration (Ollama, LM Studio, OpenAI, Claude, Gemini) - Connection testing and model discovery for local LLM providers - Worker thread pattern for non-blocking UI during document translation LLM improvements: - Synchronized chunking (2000 chars) to prevent source/translation mismatches - Proportional splitting of long lines to maintain alignment - Ultra-short prompts optimized for 32k context models - Enhanced error reporting with detailed HTTP status and response bodies Tested with: - LM Studio (qwen/qwen3-4b-2507) - EPUB translation (Overlord Vol. 16, 7.3MB) successfully completed - CLI and GUI modes both functional Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit addresses two critical issues with document translation: 1. EPUB Structure Preservation (CRITICAL FIX) - DocumentSplitter now stores original XHTML content alongside extracted text - DocumentMerger uses new replaceTextInXhtml() to preserve HTML DOM structure - Only text nodes are replaced; all HTML tags, CSS links, classes preserved - Before: Structure destroyed, CSS lost, paragraphs merged - After: Full structure preserved (<h1>, <h2>, <p> tags, main.css link, etc.) 2. GUI Progress Bar Improvements - DocumentTranslationDialog now connects to LLMInterface::verificationProgress - Real-time chunk-level updates during AI improvement - Status shows "AI improving segment X of Y (chunk Z/W)..." - Before: Progress bar froze during LLM processing - After: Smooth real-time updates showing actual progress Technical Details: - Added Segment.originalXhtml field to preserve structure - Implemented word-based proportional text replacement algorithm - Maintains source/translation synchronization across chunks - AI improvement remains disabled by default (llmEnabled=false) Files Modified: - src/DocumentSplitter.h/cpp: Store original XHTML - src/DocumentMerger.h/cpp: Preserve structure during merge - src/DocumentTranslationDialog.cpp: Connect progress signals - CLAUDE.md: Updated documentation Tested with: - Overlord Vol. 16 EPUB (7.3MB, 24 segments) - Structure fully preserved (headings, CSS, paragraphs) - Progress bar updates in real-time during LLM processing Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Delete Implementation_Roadmap.md (no longer needed, features complete) - Rename DOCUMENT_PROCESSING.md → DOCUMENTATION.md - Add comprehensive "What's New in This Fork" section highlighting: * Document processing (DOCX, EPUB, PDF, TXT) with auto-splitting * AI-powered improvement (5 providers: Ollama, LM Studio, OpenAI, Claude, Gemini) * Full XHTML/HTML structure preservation for EPUB * Real-time GUI progress tracking * Synchronized chunking algorithm (2000 chars) * Complete file list with line counts - Maintain full user guide with quick start, configuration, troubleshooting - Emphasize 100% backward compatibility with original translateLocally - Add comparison table: Original vs Enhanced Fork Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Remove .claude/settings.local.json (local configuration, should not be in repo) - Update .gitignore to exclude: * .claude/ directory (Claude Code local settings) * SESSION_LOG.md and *.patch (temporary files) * Test files (*_test.*, test_*.*, *_translated.*) This ensures only source code and documentation are tracked in the repository. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
EPUB fixes: - DocumentSplitter now detects paragraph boundaries (<p>, <h1>-<h6>) and preserves them with newlines, matching DOCX behavior - Previously concatenated all text with spaces, losing structure - Simplified replaceTextInXhtml to replace entire paragraph content while keeping tag structure - Trade-off: inline formatting (<b>, <i>) lost but translations correct DOCX improvements: - Added replaceTextInWordXml helper for paragraph-by-paragraph text replacement - Preserves paragraph properties and structure while simplifying text content - Prevents word-sticking, formatting misalignment, and duplicate text issues Both formats now use consistent paragraph-level approach for reliable document reconstruction. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
CLAUDE.md updates: - Clarified that paragraph-by-paragraph approach is used for both EPUB and DOCX - Updated implementation details to explain trade-off: inline formatting removed but paragraph structure preserved - Added technical reasoning in Known Limitations section DOCUMENTATION.md updates: - Updated "Structure Preservation Strategy" section with accurate description of current approach - Revised EPUB format description to clarify limitations and benefits - Fixed troubleshooting entry about structure corruption with commit reference - Added comprehensive "Important Notes on Formatting Preservation" section explaining: * What is preserved (document structure, paragraphs, metadata, images) * What is not preserved (inline formatting like bold/italic within paragraphs) * Technical reasoning behind the design choice * Examples of problematic behaviors from word-by-word approach * Use case recommendations These updates provide transparency about the deliberate trade-off made to ensure correct, readable translations without text mangling, word-sticking, or duplicate text issues. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit implements PDF translation using Poppler C++ API as an alternative to LibreOffice conversion, providing faster and more reliable PDF text extraction. Key changes: - CMakeLists.txt: Add Poppler detection and linking for vcpkg * Auto-detect Poppler C++ library in vcpkg installations * Link with poppler-cpp.lib and poppler.lib * Define HAVE_POPPLER when available - DocumentSplitter.cpp/h: Implement Poppler-based PDF extraction * New splitPdfWithPoppler() function using Poppler C++ API * Extract text page-by-page with UTF-8 conversion * Use "pdf_poppler_" identifier prefix for Poppler segments * Fallback to LibreOffice when Poppler unavailable - DocumentProcessor.cpp: Smart PDF output handling * Detect Poppler vs LibreOffice PDF processing * Output TXT for Poppler (text-only extraction) * Output DOCX for LibreOffice (preserves structure) Benefits: - No LibreOffice dependency required - Faster PDF processing (direct text extraction) - Qt version independent (uses C++ API, not Qt wrapper) - Works with existing LLM improvement pipeline Tested with A2-English-test-with-answers.pdf: - Basic translation: working - AI-improved translation: excellent quality Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…Zip Slip), performance optimizations (word count), and UI/UX improvements. Cleaned up temporary files.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
translateLocally Enhanced - Documentation
A fork of translateLocally with document processing and AI-powered translation improvement.
What's New in This Fork
This enhanced version of translateLocally adds professional document translation capabilities and AI-powered post-editing while maintaining full compatibility with the original project.
Major Improvements Over Original
Key Technical Innovations
1. Document Processing Architecture
DocumentSplitter: Intelligent segmentation of documents into ≤8MB chunks
DocumentMerger: Reconstruction with perfect structure preservation
DocumentProcessor: High-level orchestration API
open()→getSegments()→setTranslatedSegments()→save()2. AI-Powered Translation Improvement
LLMInterface: Unified abstraction for 5 AI providers
Synchronized Chunking: 2000-character chunks (~600-700 tokens) with source/translation alignment to prevent text mismatches
Sequential Processing: maxConcurrent=1 for local LLM stability
Optimized Prompts: Engineered to minimize verbosity and reasoning artifacts
3. GUI Integration
4. Structure Preservation Strategy
EPUB XHTML Preservation: Paragraph-by-paragraph replacement approach
<p>, heading tags<h1>-<h6><b>,<i>,<span>) within paragraphs is removed to ensure clean text replacement without manglingDOCX Structure Preservation: Similar paragraph-level approach
5. Settings Persistence
New configuration parameters in QSettings:
Files Added to Original Codebase
Core Document Processing (8 files):
GUI Integration (3 files):
Modified Files:
--ai-improveflagTotal Addition: ~2,200 lines of new code
Compatibility with Original
This fork maintains 100% backward compatibility with the original translateLocally:
--ai-improveflag or GUI actions)Table of Contents
Quick Start
Basic Document Translation
Translate a Word document from English to French:
With AI Improvement
Improve translation quality using a local AI model (requires Ollama or LM Studio):
GUI Document Translation
Supported Formats
TXT (Plain Text)
DOCX (Microsoft Word)
word/document.xmlvia libarchiveEPUB (E-books)
<p>,<h1>-<h6>), preserves paragraph tags while replacing text content<b>,<i>,<em>,<strong>) is removed to ensure correct translation without text manglingPDF (Portable Document Format)
soffice.exein PATHDocument Translation
How It Works
Document Split: Large documents are automatically split into segments of max 8 MB
Translation: Each segment is translated using the selected Marian model
Document Reconstruction: Translated segments are merged back
Example: Translating Large Documents
For documents larger than 10 MB, automatic splitting ensures smooth processing:
# Translate a 50 MB book from Spanish to English translateLocally -m es-en-base -i libro_grande.epub -o big_book.epubThe document will be automatically split into ~7 segments, translated individually, and reassembled.
AI-Powered Improvement
What Is AI Improvement?
AI improvement uses large language models (LLMs) to:
Supported AI Providers
Local AI (Recommended for Privacy)
Ollama (Free, runs locally)
ollama pull mistralOllamahttp://localhost:11434mistralLM Studio (Free, runs locally)
LM Studiohttp://localhost:1234Cloud AI (Requires API Keys)
OpenAI (GPT-3.5, GPT-4)
OpenAIhttps://api.openai.comgpt-4o-mini(recommended) orgpt-4oClaude (Anthropic)
Claudehttps://api.anthropic.comclaude-3-5-sonnet-20241022(recommended)Gemini (Google)
Geminihttps://generativelanguage.googleapis.comgemini-1.5-flashorgemini-1.5-proHow AI Improvement Works
Usage Example
Configuration
Settings Location
LLM settings are stored in:
HKEY_CURRENT_USER\Software\translateLocally\translateLocally~/.config/translateLocally/translateLocally.conf~/Library/Preferences/com.translateLocally.translateLocally.plistAvailable Settings
llmEnabledtrueorfalsellmProviderOllama,LM Studio,OpenAI,Claude,GeminillmUrlhttp://localhost:11434llmModelmistral,gpt-4o-mini,claude-3-5-sonnet-20241022openaiApiKeysk-...claudeApiKeysk-ant-...geminiApiKeyAI...Manual Configuration Example
For Ollama with Mistral model, add to settings file:
GUI Configuration
Advanced Usage
Batch Processing Multiple Documents
Translate all DOCX files in a directory:
Pivot Translation with AI
Translate Spanish → English → German with AI improvement at each step:
Choosing the Right Model Size
For speed:
tinymodels: Fast but lower qualityen-fr-tinyFor quality:
basemodels: Slower but better qualityen-fr-baseFor AI improvement:
tinyMarian + AI improvement for best speed/quality balanceTesting AI Improvement
Compare translations with and without AI:
Important Notes on Formatting Preservation
What Is Preserved
✅ Document structure: Chapters, sections, table of contents
✅ Paragraph boundaries: Headings (
<h1>-<h6>), paragraphs (<p>) maintain proper structure✅ Document-level formatting: Stylesheets, CSS classes, fonts, page layout
✅ Non-text content: Images, cover art, metadata, embedded files
✅ Archive integrity: DOCX and EPUB ZIP structure fully maintained
What Is Not Preserved
❌ Inline text formatting: Bold, italic, underline, font changes within paragraphs
❌ Spans and inline styles:
<b>,<i>,<em>,<strong>,<span>tags within text❌ Complex inline structures: Nested formatting, hyperlinks within text (document structure links preserved)
Why This Limitation Exists
This is a deliberate technical choice to ensure translation quality:
Problem with inline formatting preservation:
When attempting to preserve inline formatting (e.g., keeping
<b>bold words</b>bold), we encountered severe issues:<b>simple test</b>became<b>document de</b>)<itinstead of<i>)Current solution:
Best for:
Not ideal for:
For these cases, consider manual post-processing to re-apply formatting based on the original document.
Troubleshooting
Document Processing Issues
Problem: "Failed to open document"
Problem: "Segment too large" error
Problem: PDF processing fails
soffice.exeis accessibleProblem: DOCX/EPUB structure corrupted after translation
Complex document with unusual formattingFIXED in commit 99e292c<b>,<i>) within paragraphs is removed as a trade-off for correct text replacement.AI Improvement Issues
Problem: AI improvement not working
ollama listandollama psProblem: "Connection refused" error
Problem: AI makes translation worse
mistral→llama3)Problem: Slow AI processing
mistralinstead ofllama3:70b)Problem: API rate limiting errors
General Troubleshooting
Enable debug output:
This will show detailed information about:
Check dependencies:
Performance Optimization
Document Processing
Recommendation: Convert PDFs to DOCX manually before batch processing
AI Improvement
Recommendation: Use local LLM for sensitive documents, cloud for speed
Model Selection
en-fr-tiny+ AIen-fr-baseno AIen-fr-base+ AIExamples Gallery
Example 1: Academic Paper
# Translate research paper with AI improvement for academic quality translateLocally -m en-es-base -i paper.docx -o articulo.docx --ai-improveExample 2: Novel Translation
# Translate e-book preserving chapter structure translateLocally -m en-fr-base -i novel.epub -o roman.epub --ai-improveExample 3: Business Report
# Fast translation for internal document translateLocally -m en-de-tiny -i report.docx -o bericht.docxExample 4: Multilingual Batch
FAQ
Q: Can I use AI improvement without internet?
A: Yes! Install Ollama or LM Studio for completely offline AI improvement.
Q: How much does cloud AI cost?
A: Varies by provider. GPT-4o-mini costs ~$0.15 per million input tokens. A 10,000-word document costs roughly $0.002.
Q: Does AI improvement work with all language pairs?
A: Yes, but quality depends on the AI model's training. English, French, German, Spanish, Italian, Portuguese, Chinese, and Japanese typically work best.
Q: Can I disable AI improvement for specific documents?
A: Yes, simply omit the
--ai-improveflag or uncheck the option in the GUI.Q: What happens to images in DOCX/EPUB files?
A: Images and other binary content are preserved unchanged in the output document.
Q: Can I translate password-protected PDFs?
A: No, remove password protection first.
Q: Is my text sent to the internet when using local LLMs?
A: No, local LLMs (Ollama, LM Studio) process everything on your machine.
Q: Are these features compatible with the original translateLocally?
A: Yes! This is a 100% backward-compatible fork. All original functionality works unchanged. New features are purely additive.
Q: Can I contribute these features back to the original project?
A: Contributions are welcome! Submit pull requests to the upstream repository at https://github.com/XapaJIaMnu/translateLocally
Support and Contributing
For issues, feature requests, or contributions:
License
Document processing and AI improvement features are part of translateLocally and follow the same license as the original project.
Credits