Skip to content

Releases: oomol-lab/pdf-craft

v1.0.12

02 Mar 06:45
bc0a844

Choose a tag to compare

This release focuses on text quality in long-form documents: better cross-page paragraph merging and context-aware punctuation normalization for Chinese text.

What's Changed

Features

  • Chinese punctuation normalization in #351
    • Added punctuation normalization in chapter generation flow before internal level analysis
    • Converts ASCII punctuation to full-width punctuation in Han text context
    • Applies to chapter content, references, and asset title/caption while keeping raw asset body content unchanged
    • Handles normalization across segmented text content and HTML tag boundaries
    • Fixes #310

Bug Fixes

  • Fixed false negatives in cross-page paragraph merging in #349 & #348
    • Prioritized hyphenated-word continuation checks before numbering checks
    • Removed over-strict uppercase/Latin start heuristics that split natural paragraphs
    • Added real-world regression coverage for previously split continuous paragraphs

Pull Requests Included

Issues Linked

Full Changelog: v1.0.11...v1.0.12

v1.0.11

07 Feb 07:44
99a8a29

Choose a tag to compare

This release adds support for GitHub-Flavored Markdown (GFM) table rendering, enhances error handling for interruption scenarios, and includes important bug fixes and dependency updates.

What's Changed

Features

  • GFM Table Support: Added intelligent conversion of HTML tables to GitHub-Flavored Markdown format in #345

    • Simple tables are automatically converted to clean GFM pipe table syntax
    • Complex tables (with colspan, rowspan, or multiple tbody sections) gracefully fall back to HTML format to preserve structure
    • Prevents data loss from unsupported table features in GFM format
    • Added comprehensive test coverage for various table scenarios
    • New dependency: markdownify library for table conversion
  • Enhanced InterruptedError API: Added public properties to InterruptedError for better error introspection in #346

    • New kind property exposes the interruption type (abort or token limit exceeded)
    • New metering property provides direct access to OCR token usage data
    • OCRTokensMetering is now exported from the public API for convenience
    • Enables users to programmatically handle different interruption scenarios and track resource consumption

Bug Fixes

  • Fixed Error Propagation: Corrected handling of critical error types during page extraction in #343
    • AbortError and TokenLimitError now propagate correctly instead of being wrapped in OCRError
    • Ensures interruption signals are properly received and handled by calling code
    • Prevents masking of user-initiated abort operations and token limit violations

Dependencies

  • EPUB Generator Update: Upgraded epub-generator dependency to fix MathML property declaration bug in #344
    • Fixes Moskize91/epub-generator#22: OPF files incorrectly declared mathml property when LaTeX-to-MathML conversion failed, causing EPUBCheck validation failures
    • EPUB files now pass validation by only declaring MathML properties when actual MathML content exists

Migration Notes

InterruptedError Changes

If you're catching InterruptedError exceptions, you can now access detailed information about the interruption:

from pdf_craft import transform_markdown, InterruptedError, InterruptedKind

try:
    transform_markdown(
        pdf_path="input.pdf",
        markdown_path="output.md",
    )
except InterruptedError as error:
    # New in v1.0.11: Access interruption details
    if error.kind == InterruptedKind.ABORT:
        print("User aborted the operation")
    elif error.kind == InterruptedKind.TOKEN_LIMIT_EXCEEDED:
        print(f"Token limit exceeded: {error.metering.input_tokens} input tokens used")

    # Access token usage statistics
    print(f"Total tokens: {error.metering.input_tokens + error.metering.output_tokens}")

Table Rendering

Tables in your PDF documents will now be converted to GFM format when possible, making them more readable in markdown viewers. Complex tables will automatically fall back to HTML to preserve their structure.

Full Changelog: v1.0.10...v1.0.11

v1.0.10

03 Feb 05:29
9b4876e

Choose a tag to compare

This release simplifies the table of contents (TOC) extraction API by replacing enum-based modes with a boolean flag, while adding LLM-powered chapter title analysis capabilities for improved TOC hierarchy detection.

What's Changed

Breaking Changes

  • Simplified TOC API: Replaced TocExtractionMode enum with a simpler toc_assumed boolean parameter in #341
    • Removed toc_mode parameter from transform_markdown() and transform_epub() functions
    • Removed TocExtractionMode from public API exports
    • Introduced toc_assumed boolean flag to control TOC detection behavior

Features

  • LLM-Powered Chapter Title Analysis: Added support for LLM-based analysis of chapter titles to enhance TOC extraction accuracy in #341
    • Automatically analyzes chapter title hierarchies when toc_llm is configured
    • Provides more accurate chapter level detection for complex book structures
    • Intelligently falls back to standard analysis when LLM is unavailable or encounters errors

Improvements

  • Enhanced Error Handling: Added robust error handling for LLM-based analysis with automatic recovery mechanisms in #341
    • Better error diagnostics for LLM analysis failures
    • Graceful degradation when LLM analysis fails, ensuring conversion continues successfully

Migration Guide

If you were using toc_mode in previous versions, update your code as follows:

Previous API (v1.0.9 and earlier)

from pdf_craft import transform_markdown, TocExtractionMode

# For Markdown conversion
transform_markdown(
    pdf_path="input.pdf",
    markdown_path="output.md",
    toc_mode=TocExtractionMode.NO_TOC_PAGE,  # Old parameter
)

# For EPUB conversion
transform_epub(
    pdf_path="input.pdf",
    epub_path="output.epub",
    toc_mode=TocExtractionMode.AUTO_DETECT,  # Old parameter
)

New API (v1.0.10)

from pdf_craft import transform_markdown

# For Markdown conversion (assumes no TOC pages by default)
transform_markdown(
    pdf_path="input.pdf",
    markdown_path="output.md",
    toc_assumed=False,  # New boolean parameter (default: False)
)

# For EPUB conversion (assumes TOC pages exist)
transform_epub(
    pdf_path="input.pdf",
    epub_path="output.epub",
    toc_assumed=True,  # New boolean parameter
)

Migration Mapping

Old toc_mode Value New toc_assumed Value
TocExtractionMode.NO_TOC_PAGE False
TocExtractionMode.AUTO_DETECT True
TocExtractionMode.LLM_ENHANCED True (with toc_llm configured)

LLM-Enhanced TOC Extraction

To use LLM-powered chapter title analysis:

from pdf_craft import transform_epub, BookMeta, LLM

# Configure LLM for TOC enhancement
toc_llm = LLM(
    key="your-api-key",
    url="https://api.openai.com/v1",
    model="gpt-4",
    token_encoding="cl100k_base",
)

transform_epub(
    pdf_path="input.pdf",
    epub_path="output.epub",
    toc_assumed=True,  # Enable TOC detection
    toc_llm=toc_llm,   # Enable LLM-powered analysis
    book_meta=BookMeta(
        title="Book Title",
        authors=["Author"],
    ),
)

Notes

  • The toc_assumed parameter defaults to False for Markdown conversion and True for EPUB conversion (maintaining backward-compatible behavior)
  • LLM-powered chapter title analysis is optional and automatically falls back to standard analysis if not configured or if errors occur
  • The new API is simpler and more intuitive, reducing the cognitive load of choosing between multiple enum values

Full Changelog: v1.0.9...v1.0.10

v1.0.9

02 Feb 08:32
d1498b5

Choose a tag to compare

This release introduces enhanced table of contents (TOC) extraction capabilities using LLM-powered analysis, enabling more accurate chapter structure detection and hierarchy recognition.

What's Changed

Features

  • LLM-Powered TOC Level Extraction: Implemented LLM-based analysis to automatically extract and recognize hierarchical levels in table of contents, improving chapter structure accuracy in #336

  • Enhanced TOC Page Processing: Modified the TOC detection algorithm to pass all identified TOC pages to the LLM for comprehensive analysis, rather than processing them individually in #338

    • Improves the accuracy of chapter hierarchy detection
    • Provides better context for LLM analysis by including all TOC pages

Refactoring

  • LLM Analyzer Refactoring: Refactored llm_analyser.py to improve code maintainability and extensibility in #339

Background

Previously, pdf-craft used statistical analysis to detect TOC pages and extract chapter structure. While effective for basic cases, this approach had limitations in accurately determining chapter hierarchies and handling complex TOC layouts. This release introduces LLM-powered analysis to better understand TOC structure and extract hierarchical information.

How It Works

The new TOC extraction process:

  1. Identify TOC Pages: Uses statistical analysis to detect which pages contain table of contents
  2. Collect All TOC Pages: Gathers all identified TOC pages for comprehensive analysis
  3. LLM Analysis: Passes all TOC pages to an LLM to extract chapter titles and their hierarchical levels
  4. Structure Generation: Uses the extracted hierarchy information to build accurate EPUB navigation structure

This approach combines the efficiency of statistical detection with the semantic understanding capabilities of LLMs, resulting in more accurate chapter organization in the final output.

Usage

The TOC extraction improvements are automatically applied when using the appropriate toc_mode:

from pdf_craft import transform_epub, BookMeta, TocExtractionMode

# Use AUTO_DETECT for statistical analysis (default for EPUB)
transform_epub(
    pdf_path="input.pdf",
    epub_path="output.epub",
    toc_mode=TocExtractionMode.AUTO_DETECT,
    book_meta=BookMeta(
        title="Book Title",
        authors=["Author"],
    ),
)

# Use LLM_ENHANCED for LLM-powered extraction (requires toc_llm configuration)
from pdf_craft import LLM

toc_llm = LLM(
    key="your-api-key",
    url="https://api.openai.com/v1",
    model="gpt-4",
    token_encoding="cl100k_base",
)

transform_epub(
    pdf_path="input.pdf",
    epub_path="output.epub",
    toc_mode=TocExtractionMode.LLM_ENHANCED,
    toc_llm=toc_llm,
    book_meta=BookMeta(
        title="Book Title",
        authors=["Author"],
    ),
)

Notes

  • Important: When using TocExtractionMode.LLM_ENHANCED, the toc_llm parameter must be configured. The conversion will fail if toc_llm is not provided.
  • This feature is most beneficial for books with complex chapter hierarchies
  • The statistical TOC page detection remains as the first step, with LLM analysis enhancing the extraction quality

Full Changelog: v1.0.8...v1.0.9

v1.0.8

14 Jan 02:49
69b774d

Choose a tag to compare

This release brings enhanced error handling flexibility, improved OCR text quality, and important security fixes.

What's Changed

Features

  • Enhanced Error Handling: The ignore_pdf_errors and ignore_ocr_errors parameters now accept custom checker functions in addition to boolean flags, enabling more granular control over error suppression in #323

  • Improved OCR Text Quality: Implemented n-gram detection to automatically filter out repetitive character sequences that indicate neural text degradation in #330

Security

  • Security Fix: Upgraded pypdf from ^6.4.1 to ^6.6.0 to address CVE-2026-22691 vulnerability in #329
    • Fixes issue where malicious PDFs could cause long-running processes when processing invalid startxref entries
    • Resolves #328

Other

Example Usage

Custom Error Handling with Functions

from pdf_craft import transform_markdown, OCRError

def should_ignore_ocr_error(error: OCRError) -> bool:
    # Only ignore specific types of OCR errors
    return error.kind == "recognition_failed"

transform_markdown(
    pdf_path="input.pdf",
    markdown_path="output.md",
    ignore_ocr_errors=should_ignore_ocr_error,  # Pass custom function
)

Traditional Boolean Error Handling (Still Supported)

from pdf_craft import transform_markdown

transform_markdown(
    pdf_path="input.pdf",
    markdown_path="output.md",
    ignore_ocr_errors=True,  # Simple boolean flag
)

API Changes

The following parameters have been enhanced to accept both boolean values and callable functions:

  • ignore_pdf_errors: bool | Callable[[PDFError], bool]
  • ignore_ocr_errors: bool | Callable[[OCRError], bool]

This change is fully backward compatible - existing code using boolean values will continue to work without modifications.

Full Changelog: v1.0.7...v1.0.8

v1.0.7

30 Dec 10:14
2ea8147

Choose a tag to compare

This release adds support for including cover images in both Markdown and EPUB conversions, enhancing the output format options.

What's Changed

Features

  • Cover Image Support: Added includes_cover parameter to both transform_markdown and transform_epub functions, allowing you to include the PDF's cover page as an image in the output in #319
    • For Markdown conversion: The cover image is saved to the images folder and can be referenced in your document
    • For EPUB conversion: The cover image is properly embedded in the EPUB file structure
    • Default value is False for Markdown (to maintain backward compatibility) and True for EPUB

Example Usage

Markdown with Cover

from pdf_craft import transform_markdown

transform_markdown(
    pdf_path="input.pdf",
    markdown_path="output.md",
    markdown_assets_path="images",
    includes_cover=True,  # Include cover image
)

EPUB with Cover

from pdf_craft import transform_epub, BookMeta

transform_epub(
    pdf_path="input.pdf",
    epub_path="output.epub",
    includes_cover=True,  # Include cover image (default)
    book_meta=BookMeta(
        title="Book Title",
        authors=["Author"],
    ),
)

Full Changelog: v1.0.6...v1.0.7

v1.0.6

22 Dec 08:30
03e1b51

Choose a tag to compare

This release brings significant improvements to PDF rendering control, text quality, and error handling capabilities.

What's Changed

Features

  • Flexible DPI Control: Added dpi parameter to control PDF page rendering resolution (default: 300 DPI), allowing you to balance between image quality and file size in #315

  • Automatic Image Size Optimization: Introduced max_page_image_file_size parameter that automatically adjusts DPI when generated images exceed specified size limits, preventing overly large output files in #315

  • Resilient OCR Processing: Added ignore_ocr_errors parameter to continue processing when OCR recognition fails on individual pages, instead of stopping the entire conversion in #314

  • Improved Text Quality: Automatically removes Unicode surrogate characters from OCR-extracted text and PDF metadata (title, authors, publisher, etc.), ensuring cleaner output and better compatibility with downstream tools in #316

Documentation

  • DeepWiki Integration: Added DeepWiki badge for auto-refreshing documentation in by @YogeLiu #285

Dependencies

  • Updated epub-generator to 0.1.6

Example Usage

from pdf_craft import transform_markdown

transform_markdown(
    pdf_path="input.pdf",
    markdown_path="output.md",
    dpi=300,  # Control rendering resolution
    max_page_image_file_size=5242880,  # 5MB limit per page
    ignore_ocr_errors=True,  # Continue on OCR failures
)

Full Changelog: v1.0.5...v1.0.6

v1.0.5

20 Dec 06:07
b9ccc67

Choose a tag to compare

Release v1.0.5

What's Changed

Bug Fixes

  • GPU memory overflow: Fix out-of-memory errors on RTX 3060 (12GB VRAM) by upgrading doc-page-extractor dependency to optimize model loading sequence (#309, fixes #305)
  • TOC detection: Improve table of contents detection accuracy by ensuring page indexes are consecutive sequences within the first 17% of document and adding _TOC_SCORE_MIN_RATIO limitation (#311, #313)
  • Content processing: Fix content override issue (#312)

Full Changelog: v1.0.4...v1.0.5

v1.0.4

19 Dec 07:58
eea9cf7

Choose a tag to compare

Release v1.0.4

What's New

🎯 Table of Contents Detection and Smart Removal

pdf-craft now automatically detects and removes table of contents pages from the final output, preventing duplicate TOC content in generated EPUB files. The system uses statistical analysis to identify TOC pages by matching chapter titles against page content, then intelligently excludes these pages while preserving the navigation structure.

Related: #268

Key features:

  • Automatic TOC page detection using Aho-Corasick substring matching
  • Hierarchical TOC level analysis for improved chapter organization
  • XML-based TOC storage for better performance and flexibility
  • New toc_assumed parameter to control TOC detection behavior (default: True for EPUB, False for Markdown)

Implementation PRs:

📝 Raw HTML Tag Support in Markdown

Full support for CommonMark-compliant raw HTML tags in Markdown output. DeepSeek OCR often generates HTML tags (like <sup> for superscripts) when processing scanned books - these are now properly preserved and rendered in both Markdown and EPUB formats.

Related: #283

Supported tags include:

  • Inline tags: <sup>, <sub>, <mark>, <u>, <kbd>
  • Block-level tags: <div>, <center>, <details>, <summary>
  • Automatic safety filtering and attribute validation

Implementation PRs:

📊 Enhanced Table Rendering

Tables are now rendered in native HTML format for both Markdown and EPUB outputs, providing better structure and readability. Asset metadata now supports structured titles and captions for equations, images, and tables.

#306

📖 PDF Metadata Extraction

Automatically extracts book metadata (title, authors, publisher, ISBN, etc.) from PDF files and uses it to populate EPUB metadata. No need to manually specify book information when the PDF already contains it.

#284

📰 Multi-Column Layout Detection

Improved handling of multi-column layouts (common in academic papers and magazines) through histogram valley detection and coefficient-of-variation splitting. Layouts are now correctly grouped by column segments before processing.

#286

🐛 Bug Fixes

  • Fixed PIL crash on invalid bounding boxes: Added validation and normalization for layout bounding boxes to prevent crashes when cropping images with invalid coordinates (#295)

  • Fixed DeepSeek OCR center tag handling: Ignored alignment tags (<center>, <left>, <right>) generated by DeepSeek OCR that aren't needed in the output (#307)

🔧 Improvements

  • Refined layout joining logic: Improved paragraph merging across page boundaries with better handling of override assets and line continuation (#287, #288)

  • Updated dependencies:

    • Upgraded doc-page-extractor from 1.0.10 to 1.0.11 (#289)
    • Upgraded epub-generator from 0.1.2 to 0.1.5
    • Added pyahocorasick 2.2.0 for efficient substring matching
  • CI/CD enhancements: Added merge-build workflow for automated builds on main branch pushes (#289)

📚 Documentation

  • Updated README with new toc_assumed parameter documentation (#304)
  • Refreshed documentation images with hosted assets

🔄 API Changes

New Parameters

  • toc_assumed parameter in transform_markdown() and transform_epub():
    • When True: Attempts to locate and extract TOC from PDF to build document structure
    • When False: Generates TOC based on document headings only
    • Default: True for EPUB, False for Markdown

New Exports

  • PDFDocumentMetadata: Dataclass for PDF metadata extraction

🙏 Contributors

Thanks to everyone who contributed to this release!

📦 Installation

pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
pip install pdf-craft==1.0.4

For detailed installation instructions, see the Installation Guide.


Full Changelog: v1.0.3...v1.0.4

v1.0.3

11 Dec 09:05
0eb2615

Choose a tag to compare

Release v1.0.3

What's Changed

License Improvements

  • Removed PyMuPDF (fitz) Dependency: Replaced PyMuPDF (AGPL-3.0) with Poppler for PDF parsing and rendering, maintaining pdf-craft's MIT license compatibility
    • pdf-craft now uses Poppler via pdf2image (MIT) for all PDF operations
    • This change ensures the entire project remains under the permissive MIT license

New Features

  • Custom PDF Handler Support: Added pdf_handler parameter to predownload_models(), transform_markdown(), and transform_epub() functions, allowing users to customize PDF rendering implementation
  • Poppler Integration: Migrated to Poppler (via pdf2image) for PDF parsing and rendering, providing better compatibility and control
  • New Public APIs: Exported PDFHandler, PDFDocument, DefaultPDFHandler, and DefaultPDFDocument for advanced customization
  • RENDERED Event: Added OCREventKind.RENDERED event to track PDF page rendering progress

Breaking Changes

⚠️ Parameter Renamed: ignore_fitz_errorsignore_pdf_errors

  • Update your code: transform_markdown(..., ignore_pdf_errors=True) instead of ignore_fitz_errors=True
  • Update your code: transform_epub(..., ignore_pdf_errors=True) instead of ignore_fitz_errors=True

⚠️ Exception Renamed: FitzErrorPDFError

  • Update your exception handling code accordingly

Dependencies

Bug Fixes

  • Upgraded doc-page-extractor to fix bugs (#280)

Migration Guide

If you're upgrading from v1.0.2, please:

  1. Install Poppler following the Installation Guide
  2. Update parameter names in your code:
    # Before (v1.0.2)
    transform_markdown(..., ignore_fitz_errors=True)
    
    # After (v1.0.3)
    transform_markdown(..., ignore_pdf_errors=True)
  3. Update exception handling if you catch FitzError:
    # Before (v1.0.2)
    from pdf_craft import FitzError
    
    # After (v1.0.3)
    from pdf_craft import PDFError

Full Changelog

Full Changelog: v1.0.2...v1.0.3