Releases: oomol-lab/pdf-craft
v1.0.12
This release focuses on text quality in long-form documents: better cross-page paragraph merging and context-aware punctuation normalization for Chinese text.
What's Changed
Features
- Chinese punctuation normalization in #351
- Added punctuation normalization in chapter generation flow before internal level analysis
- Converts ASCII punctuation to full-width punctuation in Han text context
- Applies to chapter content, references, and asset title/caption while keeping raw asset body content unchanged
- Handles normalization across segmented text content and HTML tag boundaries
- Fixes #310
Bug Fixes
- Fixed false negatives in cross-page paragraph merging in #349 & #348
- Prioritized hyphenated-word continuation checks before numbering checks
- Removed over-strict uppercase/Latin start heuristics that split natural paragraphs
- Added real-world regression coverage for previously split continuous paragraphs
Pull Requests Included
Issues Linked
Full Changelog: v1.0.11...v1.0.12
v1.0.11
This release adds support for GitHub-Flavored Markdown (GFM) table rendering, enhances error handling for interruption scenarios, and includes important bug fixes and dependency updates.
What's Changed
Features
-
GFM Table Support: Added intelligent conversion of HTML tables to GitHub-Flavored Markdown format in #345
- Simple tables are automatically converted to clean GFM pipe table syntax
- Complex tables (with colspan, rowspan, or multiple tbody sections) gracefully fall back to HTML format to preserve structure
- Prevents data loss from unsupported table features in GFM format
- Added comprehensive test coverage for various table scenarios
- New dependency:
markdownifylibrary for table conversion
-
Enhanced InterruptedError API: Added public properties to
InterruptedErrorfor better error introspection in #346- New
kindproperty exposes the interruption type (abort or token limit exceeded) - New
meteringproperty provides direct access to OCR token usage data OCRTokensMeteringis now exported from the public API for convenience- Enables users to programmatically handle different interruption scenarios and track resource consumption
- New
Bug Fixes
- Fixed Error Propagation: Corrected handling of critical error types during page extraction in #343
AbortErrorandTokenLimitErrornow propagate correctly instead of being wrapped inOCRError- Ensures interruption signals are properly received and handled by calling code
- Prevents masking of user-initiated abort operations and token limit violations
Dependencies
- EPUB Generator Update: Upgraded
epub-generatordependency to fix MathML property declaration bug in #344- Fixes Moskize91/epub-generator#22: OPF files incorrectly declared
mathmlproperty when LaTeX-to-MathML conversion failed, causing EPUBCheck validation failures - EPUB files now pass validation by only declaring MathML properties when actual MathML content exists
- Fixes Moskize91/epub-generator#22: OPF files incorrectly declared
Migration Notes
InterruptedError Changes
If you're catching InterruptedError exceptions, you can now access detailed information about the interruption:
from pdf_craft import transform_markdown, InterruptedError, InterruptedKind
try:
transform_markdown(
pdf_path="input.pdf",
markdown_path="output.md",
)
except InterruptedError as error:
# New in v1.0.11: Access interruption details
if error.kind == InterruptedKind.ABORT:
print("User aborted the operation")
elif error.kind == InterruptedKind.TOKEN_LIMIT_EXCEEDED:
print(f"Token limit exceeded: {error.metering.input_tokens} input tokens used")
# Access token usage statistics
print(f"Total tokens: {error.metering.input_tokens + error.metering.output_tokens}")Table Rendering
Tables in your PDF documents will now be converted to GFM format when possible, making them more readable in markdown viewers. Complex tables will automatically fall back to HTML to preserve their structure.
Full Changelog: v1.0.10...v1.0.11
v1.0.10
This release simplifies the table of contents (TOC) extraction API by replacing enum-based modes with a boolean flag, while adding LLM-powered chapter title analysis capabilities for improved TOC hierarchy detection.
What's Changed
Breaking Changes
- Simplified TOC API: Replaced
TocExtractionModeenum with a simplertoc_assumedboolean parameter in #341- Removed
toc_modeparameter fromtransform_markdown()andtransform_epub()functions - Removed
TocExtractionModefrom public API exports - Introduced
toc_assumedboolean flag to control TOC detection behavior
- Removed
Features
- LLM-Powered Chapter Title Analysis: Added support for LLM-based analysis of chapter titles to enhance TOC extraction accuracy in #341
- Automatically analyzes chapter title hierarchies when
toc_llmis configured - Provides more accurate chapter level detection for complex book structures
- Intelligently falls back to standard analysis when LLM is unavailable or encounters errors
- Automatically analyzes chapter title hierarchies when
Improvements
- Enhanced Error Handling: Added robust error handling for LLM-based analysis with automatic recovery mechanisms in #341
- Better error diagnostics for LLM analysis failures
- Graceful degradation when LLM analysis fails, ensuring conversion continues successfully
Migration Guide
If you were using toc_mode in previous versions, update your code as follows:
Previous API (v1.0.9 and earlier)
from pdf_craft import transform_markdown, TocExtractionMode
# For Markdown conversion
transform_markdown(
pdf_path="input.pdf",
markdown_path="output.md",
toc_mode=TocExtractionMode.NO_TOC_PAGE, # Old parameter
)
# For EPUB conversion
transform_epub(
pdf_path="input.pdf",
epub_path="output.epub",
toc_mode=TocExtractionMode.AUTO_DETECT, # Old parameter
)New API (v1.0.10)
from pdf_craft import transform_markdown
# For Markdown conversion (assumes no TOC pages by default)
transform_markdown(
pdf_path="input.pdf",
markdown_path="output.md",
toc_assumed=False, # New boolean parameter (default: False)
)
# For EPUB conversion (assumes TOC pages exist)
transform_epub(
pdf_path="input.pdf",
epub_path="output.epub",
toc_assumed=True, # New boolean parameter
)Migration Mapping
Old toc_mode Value |
New toc_assumed Value |
|---|---|
TocExtractionMode.NO_TOC_PAGE |
False |
TocExtractionMode.AUTO_DETECT |
True |
TocExtractionMode.LLM_ENHANCED |
True (with toc_llm configured) |
LLM-Enhanced TOC Extraction
To use LLM-powered chapter title analysis:
from pdf_craft import transform_epub, BookMeta, LLM
# Configure LLM for TOC enhancement
toc_llm = LLM(
key="your-api-key",
url="https://api.openai.com/v1",
model="gpt-4",
token_encoding="cl100k_base",
)
transform_epub(
pdf_path="input.pdf",
epub_path="output.epub",
toc_assumed=True, # Enable TOC detection
toc_llm=toc_llm, # Enable LLM-powered analysis
book_meta=BookMeta(
title="Book Title",
authors=["Author"],
),
)Notes
- The
toc_assumedparameter defaults toFalsefor Markdown conversion andTruefor EPUB conversion (maintaining backward-compatible behavior) - LLM-powered chapter title analysis is optional and automatically falls back to standard analysis if not configured or if errors occur
- The new API is simpler and more intuitive, reducing the cognitive load of choosing between multiple enum values
Full Changelog: v1.0.9...v1.0.10
v1.0.9
This release introduces enhanced table of contents (TOC) extraction capabilities using LLM-powered analysis, enabling more accurate chapter structure detection and hierarchy recognition.
What's Changed
Features
-
LLM-Powered TOC Level Extraction: Implemented LLM-based analysis to automatically extract and recognize hierarchical levels in table of contents, improving chapter structure accuracy in #336
- Resolves #268
-
Enhanced TOC Page Processing: Modified the TOC detection algorithm to pass all identified TOC pages to the LLM for comprehensive analysis, rather than processing them individually in #338
- Improves the accuracy of chapter hierarchy detection
- Provides better context for LLM analysis by including all TOC pages
Refactoring
- LLM Analyzer Refactoring: Refactored
llm_analyser.pyto improve code maintainability and extensibility in #339
Background
Previously, pdf-craft used statistical analysis to detect TOC pages and extract chapter structure. While effective for basic cases, this approach had limitations in accurately determining chapter hierarchies and handling complex TOC layouts. This release introduces LLM-powered analysis to better understand TOC structure and extract hierarchical information.
How It Works
The new TOC extraction process:
- Identify TOC Pages: Uses statistical analysis to detect which pages contain table of contents
- Collect All TOC Pages: Gathers all identified TOC pages for comprehensive analysis
- LLM Analysis: Passes all TOC pages to an LLM to extract chapter titles and their hierarchical levels
- Structure Generation: Uses the extracted hierarchy information to build accurate EPUB navigation structure
This approach combines the efficiency of statistical detection with the semantic understanding capabilities of LLMs, resulting in more accurate chapter organization in the final output.
Usage
The TOC extraction improvements are automatically applied when using the appropriate toc_mode:
from pdf_craft import transform_epub, BookMeta, TocExtractionMode
# Use AUTO_DETECT for statistical analysis (default for EPUB)
transform_epub(
pdf_path="input.pdf",
epub_path="output.epub",
toc_mode=TocExtractionMode.AUTO_DETECT,
book_meta=BookMeta(
title="Book Title",
authors=["Author"],
),
)
# Use LLM_ENHANCED for LLM-powered extraction (requires toc_llm configuration)
from pdf_craft import LLM
toc_llm = LLM(
key="your-api-key",
url="https://api.openai.com/v1",
model="gpt-4",
token_encoding="cl100k_base",
)
transform_epub(
pdf_path="input.pdf",
epub_path="output.epub",
toc_mode=TocExtractionMode.LLM_ENHANCED,
toc_llm=toc_llm,
book_meta=BookMeta(
title="Book Title",
authors=["Author"],
),
)Notes
- Important: When using
TocExtractionMode.LLM_ENHANCED, thetoc_llmparameter must be configured. The conversion will fail iftoc_llmis not provided. - This feature is most beneficial for books with complex chapter hierarchies
- The statistical TOC page detection remains as the first step, with LLM analysis enhancing the extraction quality
Full Changelog: v1.0.8...v1.0.9
v1.0.8
This release brings enhanced error handling flexibility, improved OCR text quality, and important security fixes.
What's Changed
Features
-
Enhanced Error Handling: The
ignore_pdf_errorsandignore_ocr_errorsparameters now accept custom checker functions in addition to boolean flags, enabling more granular control over error suppression in #323 -
Improved OCR Text Quality: Implemented n-gram detection to automatically filter out repetitive character sequences that indicate neural text degradation in #330
Security
- Security Fix: Upgraded
pypdffrom^6.4.1to^6.6.0to address CVE-2026-22691 vulnerability in #329- Fixes issue where malicious PDFs could cause long-running processes when processing invalid startxref entries
- Resolves #328
Other
- Code formatting improvements in #331
- README image link update by @alwaysmavs in #324
Example Usage
Custom Error Handling with Functions
from pdf_craft import transform_markdown, OCRError
def should_ignore_ocr_error(error: OCRError) -> bool:
# Only ignore specific types of OCR errors
return error.kind == "recognition_failed"
transform_markdown(
pdf_path="input.pdf",
markdown_path="output.md",
ignore_ocr_errors=should_ignore_ocr_error, # Pass custom function
)Traditional Boolean Error Handling (Still Supported)
from pdf_craft import transform_markdown
transform_markdown(
pdf_path="input.pdf",
markdown_path="output.md",
ignore_ocr_errors=True, # Simple boolean flag
)API Changes
The following parameters have been enhanced to accept both boolean values and callable functions:
ignore_pdf_errors:bool | Callable[[PDFError], bool]ignore_ocr_errors:bool | Callable[[OCRError], bool]
This change is fully backward compatible - existing code using boolean values will continue to work without modifications.
Full Changelog: v1.0.7...v1.0.8
v1.0.7
This release adds support for including cover images in both Markdown and EPUB conversions, enhancing the output format options.
What's Changed
Features
- Cover Image Support: Added
includes_coverparameter to bothtransform_markdownandtransform_epubfunctions, allowing you to include the PDF's cover page as an image in the output in #319- For Markdown conversion: The cover image is saved to the images folder and can be referenced in your document
- For EPUB conversion: The cover image is properly embedded in the EPUB file structure
- Default value is
Falsefor Markdown (to maintain backward compatibility) andTruefor EPUB
Example Usage
Markdown with Cover
from pdf_craft import transform_markdown
transform_markdown(
pdf_path="input.pdf",
markdown_path="output.md",
markdown_assets_path="images",
includes_cover=True, # Include cover image
)EPUB with Cover
from pdf_craft import transform_epub, BookMeta
transform_epub(
pdf_path="input.pdf",
epub_path="output.epub",
includes_cover=True, # Include cover image (default)
book_meta=BookMeta(
title="Book Title",
authors=["Author"],
),
)Full Changelog: v1.0.6...v1.0.7
v1.0.6
This release brings significant improvements to PDF rendering control, text quality, and error handling capabilities.
What's Changed
Features
-
Flexible DPI Control: Added
dpiparameter to control PDF page rendering resolution (default: 300 DPI), allowing you to balance between image quality and file size in #315 -
Automatic Image Size Optimization: Introduced
max_page_image_file_sizeparameter that automatically adjusts DPI when generated images exceed specified size limits, preventing overly large output files in #315 -
Resilient OCR Processing: Added
ignore_ocr_errorsparameter to continue processing when OCR recognition fails on individual pages, instead of stopping the entire conversion in #314 -
Improved Text Quality: Automatically removes Unicode surrogate characters from OCR-extracted text and PDF metadata (title, authors, publisher, etc.), ensuring cleaner output and better compatibility with downstream tools in #316
Documentation
Dependencies
- Updated
epub-generatorto 0.1.6
Example Usage
from pdf_craft import transform_markdown
transform_markdown(
pdf_path="input.pdf",
markdown_path="output.md",
dpi=300, # Control rendering resolution
max_page_image_file_size=5242880, # 5MB limit per page
ignore_ocr_errors=True, # Continue on OCR failures
)Full Changelog: v1.0.5...v1.0.6
v1.0.5
Release v1.0.5
What's Changed
Bug Fixes
- GPU memory overflow: Fix out-of-memory errors on RTX 3060 (12GB VRAM) by upgrading doc-page-extractor dependency to optimize model loading sequence (#309, fixes #305)
- TOC detection: Improve table of contents detection accuracy by ensuring page indexes are consecutive sequences within the first 17% of document and adding _TOC_SCORE_MIN_RATIO limitation (#311, #313)
- Content processing: Fix content override issue (#312)
Full Changelog: v1.0.4...v1.0.5
v1.0.4
Release v1.0.4
What's New
🎯 Table of Contents Detection and Smart Removal
pdf-craft now automatically detects and removes table of contents pages from the final output, preventing duplicate TOC content in generated EPUB files. The system uses statistical analysis to identify TOC pages by matching chapter titles against page content, then intelligently excludes these pages while preserving the navigation structure.
Related: #268
Key features:
- Automatic TOC page detection using Aho-Corasick substring matching
- Hierarchical TOC level analysis for improved chapter organization
- XML-based TOC storage for better performance and flexibility
- New
toc_assumedparameter to control TOC detection behavior (default:Truefor EPUB,Falsefor Markdown)
Implementation PRs:
📝 Raw HTML Tag Support in Markdown
Full support for CommonMark-compliant raw HTML tags in Markdown output. DeepSeek OCR often generates HTML tags (like <sup> for superscripts) when processing scanned books - these are now properly preserved and rendered in both Markdown and EPUB formats.
Related: #283
Supported tags include:
- Inline tags:
<sup>,<sub>,<mark>,<u>,<kbd> - Block-level tags:
<div>,<center>,<details>,<summary> - Automatic safety filtering and attribute validation
Implementation PRs:
📊 Enhanced Table Rendering
Tables are now rendered in native HTML format for both Markdown and EPUB outputs, providing better structure and readability. Asset metadata now supports structured titles and captions for equations, images, and tables.
📖 PDF Metadata Extraction
Automatically extracts book metadata (title, authors, publisher, ISBN, etc.) from PDF files and uses it to populate EPUB metadata. No need to manually specify book information when the PDF already contains it.
📰 Multi-Column Layout Detection
Improved handling of multi-column layouts (common in academic papers and magazines) through histogram valley detection and coefficient-of-variation splitting. Layouts are now correctly grouped by column segments before processing.
🐛 Bug Fixes
-
Fixed PIL crash on invalid bounding boxes: Added validation and normalization for layout bounding boxes to prevent crashes when cropping images with invalid coordinates (#295)
-
Fixed DeepSeek OCR center tag handling: Ignored alignment tags (
<center>,<left>,<right>) generated by DeepSeek OCR that aren't needed in the output (#307)
🔧 Improvements
-
Refined layout joining logic: Improved paragraph merging across page boundaries with better handling of override assets and line continuation (#287, #288)
-
Updated dependencies:
- Upgraded
doc-page-extractorfrom 1.0.10 to 1.0.11 (#289) - Upgraded
epub-generatorfrom 0.1.2 to 0.1.5 - Added
pyahocorasick2.2.0 for efficient substring matching
- Upgraded
-
CI/CD enhancements: Added merge-build workflow for automated builds on main branch pushes (#289)
📚 Documentation
- Updated README with new
toc_assumedparameter documentation (#304) - Refreshed documentation images with hosted assets
🔄 API Changes
New Parameters
toc_assumedparameter intransform_markdown()andtransform_epub():- When
True: Attempts to locate and extract TOC from PDF to build document structure - When
False: Generates TOC based on document headings only - Default:
Truefor EPUB,Falsefor Markdown
- When
New Exports
PDFDocumentMetadata: Dataclass for PDF metadata extraction
🙏 Contributors
Thanks to everyone who contributed to this release!
📦 Installation
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
pip install pdf-craft==1.0.4For detailed installation instructions, see the Installation Guide.
Full Changelog: v1.0.3...v1.0.4
v1.0.3
Release v1.0.3
What's Changed
License Improvements
- Removed PyMuPDF (fitz) Dependency: Replaced PyMuPDF (AGPL-3.0) with Poppler for PDF parsing and rendering, maintaining pdf-craft's MIT license compatibility
- pdf-craft now uses Poppler via
pdf2image(MIT) for all PDF operations - This change ensures the entire project remains under the permissive MIT license
- pdf-craft now uses Poppler via
New Features
- Custom PDF Handler Support: Added
pdf_handlerparameter topredownload_models(),transform_markdown(), andtransform_epub()functions, allowing users to customize PDF rendering implementation - Poppler Integration: Migrated to Poppler (via
pdf2image) for PDF parsing and rendering, providing better compatibility and control - New Public APIs: Exported
PDFHandler,PDFDocument,DefaultPDFHandler, andDefaultPDFDocumentfor advanced customization - RENDERED Event: Added
OCREventKind.RENDEREDevent to track PDF page rendering progress
Breaking Changes
ignore_fitz_errors → ignore_pdf_errors
- Update your code:
transform_markdown(..., ignore_pdf_errors=True)instead ofignore_fitz_errors=True - Update your code:
transform_epub(..., ignore_pdf_errors=True)instead ofignore_fitz_errors=True
FitzError → PDFError
- Update your exception handling code accordingly
Dependencies
- New Requirement: Poppler must be installed separately for PDF parsing
- Ubuntu/Debian:
sudo apt-get install poppler-utils - macOS:
brew install poppler - Windows: Download from oschwartz10612/poppler-windows
- See Installation Guide for details
- Ubuntu/Debian:
Bug Fixes
- Upgraded doc-page-extractor to fix bugs (#280)
Migration Guide
If you're upgrading from v1.0.2, please:
- Install Poppler following the Installation Guide
- Update parameter names in your code:
# Before (v1.0.2) transform_markdown(..., ignore_fitz_errors=True) # After (v1.0.3) transform_markdown(..., ignore_pdf_errors=True)
- Update exception handling if you catch
FitzError:# Before (v1.0.2) from pdf_craft import FitzError # After (v1.0.3) from pdf_craft import PDFError
Full Changelog
Full Changelog: v1.0.2...v1.0.3