PDF Processing Toolkit

Caution

USE AT YOUR OWN RISK. These tools are NOT production-ready and are intended solely for personal book translation purposes. They are experimental, unstable, and may cause failures, data loss, or unpredictable results. I provide NO GUARANTEE of their stability, reliability, or functionality. By using these tools, you accept full responsibility for any damages, errors, or consequences that may arise, including but not limited to system crashes, corrupted files, or complete loss of data. Proceed with extreme caution, as no support or liability will be provided under any circumstances.

PDF Processing Toolkit

A Python toolkit for converting PDFs to translated content with OCR support, cleaning, and proofreading. Designed for processing documents like memoirs, books, or academic papers that need accurate translation.

Setup

Important notice: Python version required = 3.13

Install dependencies:

pip install pymupdf python-dotenv pdftext marker-pdf rich

# python-dotenv openai pathlib

Create .env file:

OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-4o
OPENAI_BASE_URL=https://api.openai.com/v1

Create prompt files: prompt.txt for translation, prompt.enhance.txt for proofreading.

Processing Flow

PDF → Extract Pages → Clean Content → Fix Hyphens → Translate → Proofread → HTML

pdf_to_markdown.py - Converts PDF pages to numbered markdown files using OCR. Extracts images and handles both digital and scanned PDFs. Filters images smaller than 170x170px to remove artifacts.

cleaner.py - Removes OCR artifacts, LaTeX formulas ( $...$ patterns), and page numbers. Organizes content into <article> and <footer> sections based on <sup> tag detection.

hyphen_fix.py - Fixes words broken across pages by hyphens. Analyzes sentence context and uses AI to select correct word continuation from next 3 files. Moves sentence fragments between files when needed.

markdown_translator.py - Translates content using sliding context window algorithm. Includes previous N pages as context in each translation request to maintain consistency across the document.

proofreader.py - Proofreads translated content in configurable batches. Wraps each page in <div class="page"> tags and processes multiple pages in single API calls for efficiency.

Extract Pages - pdf_to_markdown.py

Converts PDF pages to numbered markdown files using the Marker library with OCR support.

Purpose: Extract text and images from PDF documents, handling both digital and scanned content.

Algorithm:

Page Extraction: Creates single-page PDF for each page using PyMuPDF
OCR Processing: Uses Marker library to convert PDF to markdown with configurable OCR settings
Image Filtering: Extracts all images, then filters out artifacts smaller than 170x170px
Image Naming: Prefixes remaining images with page_N_ to prevent filename conflicts
Content Cleanup: Removes markdown references to filtered images from text

Usage:

# Basic conversion
python pdf_to_markdown.py memoir.pdf ./output/

# Force OCR for scanned documents
python pdf_to_markdown.py memoir.pdf ./output/ --force-ocr

# Enhanced AI processing
python pdf_to_markdown.py memoir.pdf ./output/ --force-ocr --use-llm

# Merge all pages into single file
python pdf_to_markdown.py memoir.pdf ./output/ --merge

Output: Creates numbered files 1.md, 2.md, 3.md, etc. with extracted images in the same directory.

Clean Content - cleaner.py

Removes OCR artifacts and organizes content into structured sections.

Purpose: Clean up OCR output and standardize document structure.

Algorithm:

Fragment Removal: Removes hardcoded patterns like "&</sup>lt;sup>" from FRAGMENTS_TO_REMOVE list
LaTeX Cleanup: Removes mathematical formulas matching $...$ pattern using regex
Content Sectioning:
- Finds first line starting with <sup> tag
- Splits content: everything before = main content, everything after = footnotes
- Wraps main content in <article> tags
- Wraps footnotes in <footer> tags
Page Number Removal: Detects and removes trailing page numbers (digits or $digits$ )
Whitespace Trimming: Removes empty lines and trailing spaces inside article sections

Usage:

# Clean all markdown files in directory
python cleaner.py ./output/

# Process specific directory
python cleaner.py /path/to/markdown/files/

Example transformation:

Before: "Some text $x^2 + y = z$ more text&</sup>lt;sup>1</sup> footnote"
After:  "<article>Some text more text</article>---<footer><sup>1</sup> footnote</footer>"

Fix Hyphens - hyphen_fix.py

Fixes words broken across page boundaries by hyphens.

Purpose: Reconstruct words split by page breaks while maintaining document flow.

Algorithm:

Hyphen Detection: Scans all files to find <article> sections ending with hyphen
Candidate Analysis: For each hyphen file, examines next 3 files for potential continuations
Context Extraction: Shows user/AI the original phrase ending with hyphen
Decision Making:
- Manual mode: User selects correct continuation file
- AI mode: Sends options to LLM for automatic selection
Sentence Boundary Detection: Finds sentence start by scanning backwards for ., ?, !
Content Transfer:
- Extracts sentence fragment from source file
- Combines with first word from target file
- Updates both files with reconstructed content

Usage:

# Manual review mode
python hyphen_fix.py ./output/

# Automatic AI mode
python hyphen_fix.py ./output/ --use-llm

Example fix:

Before: File 5.md ends with "przykład-", File 6.md starts with "owo"
After: File 5.md ends with "przykład", File 6.md starts with "przykładowo"

Translate - markdown_translator.py

Translates content using sliding context window for consistency.

Purpose: Provide contextually accurate translations by considering surrounding pages.

Context Window Algorithm:

Window Construction: For page N, includes pages N-P to N-1 as context (P = window size)
Bidirectional Mode: When enabled, also includes pages N+1 to N+P
Context Formatting: Labels context as "Previous Page -1", "Previous Page -2", etc.
Translation Request: Combines context + current page in single API call
Consistency Maintenance: AI uses context to maintain terminology and style consistency

Usage:

# Basic translation to Russian
python markdown_translator.py ./output/ --lang ru

# With context window
python markdown_translator.py ./output/ --lang ru --context-window 2

# Bidirectional context
python markdown_translator.py ./output/ --lang ru --bidirectional

# Specific pages
python markdown_translator.py ./output/ --lang ru --pages "1-10,15,20-25"

Context Example (context-window 2):

Page 5: Uses pages 3-4 as context
Page 6: Uses pages 4-5 as context
Bidirectional: Page 5 uses pages 3-4 and 6-7

Output: Creates translated files 1.ru.md, 2.ru.md, 3.ru.md, etc.

Proofread - proofreader.py

Proofreads translated content in batches and generates HTML output.

Purpose: Improve translation quality and convert to publication-ready format.

Batch Processing Algorithm:

File Collection: Gathers all .ru.md files in numerical order
Batch Creation: Groups consecutive pages into batches of specified size
Page Wrapping: Wraps each page content in <div class="page" data-page="N"> tags
Batch Assembly: Combines all pages in batch into single request
API Optimization: Processes entire batch in one API call instead of individual calls
Response Processing: Extracts and reassembles individual page results
HTML Generation: Creates complete HTML document with CSS styling

Usage:

# Basic proofreading
python proofreader.py ./output/

# Custom batch size
python proofreader.py ./output/ --batch 10

# Specific pages and output file
python proofreader.py ./output/ --pages "1-50" --output part1.html

Efficiency Example (batch size 10):

100 pages = 10 API calls instead of 100
Faster processing and lower API costs
Maintains formatting consistency within batches

Output: Clean HTML file ready for printing or digital reading.

Complete Usage Example

Process a Polish memoir to Russian:

# 1. Extract PDF pages
python pdf_to_markdown.py memoir.pdf ./output/ --force-ocr

# 2. Clean content
python cleaner.py ./output/

# 3. Fix broken words
python hyphen_fix.py ./output/ --use-llm

# 4. Translate with context
python markdown_translator.py ./output/ --lang ru --context-window 2

# 5. Proofread and generate HTML
python proofreader.py ./output/ --output memoir_final.html

Reviewing with Obsidian

Use Obsidian to review the markdown files. Point it to your output directory as a vault. The file explorer shows all numbered pages, and you can compare original and translated versions side by side in split view.

Troubleshooting

Poor OCR quality: Add --use-llm to PDF conversion

Broken translations: Increase --context-window or add --bidirectional

Memory issues: Use smaller --pages ranges or reduce --batch size

Debug mode: Add --debug to any tool to see AI prompts and responses

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Processing Toolkit

Setup

Important notice: Python version required = 3.13

Processing Flow

Extract Pages - pdf_to_markdown.py

Clean Content - cleaner.py

Fix Hyphens - hyphen_fix.py

Translate - markdown_translator.py

Proofread - proofreader.py

Complete Usage Example

Reviewing with Obsidian

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.env.example		.env.example
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
cleaner.py		cleaner.py
hyphen_fix.py		hyphen_fix.py
markdown_translator.py		markdown_translator.py
pdf_to_markdown.py		pdf_to_markdown.py
prompt.enhance.txt		prompt.enhance.txt
prompt.txt		prompt.txt
proofreader.py		proofreader.py

License

a-bashtannik/trans-tools

Folders and files

Latest commit

History

Repository files navigation

PDF Processing Toolkit

Setup

Important notice: Python version required = 3.13

Processing Flow

Extract Pages - pdf_to_markdown.py

Clean Content - cleaner.py

Fix Hyphens - hyphen_fix.py

Translate - markdown_translator.py

Proofread - proofreader.py

Complete Usage Example

Reviewing with Obsidian

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages