Skip to content

feat: Add PaddleOCRVL PDF parser with restoration and split table merging#82

Merged
AdemBoukhris457 merged 4 commits intomainfrom
feature/paddleocr_vl_parser
Nov 15, 2025
Merged

feat: Add PaddleOCRVL PDF parser with restoration and split table merging#82
AdemBoukhris457 merged 4 commits intomainfrom
feature/paddleocr_vl_parser

Conversation

@AdemBoukhris457
Copy link
Owner

✨ New Features

  • PaddleOCRVL PDF Parser: Added new PaddleOCRVLPDFParser class for end-to-end document parsing

    • Uses PaddleOCRVL Vision-Language Model for comprehensive document understanding
    • Single-pass extraction of all content types (headers, text, tables, charts, footnotes, figure titles)
    • Automatic chart recognition and conversion to structured table format
    • Support for pipe-delimited chart content parsing
  • Integrated Document Restoration: Optional DocRes image restoration support

    • Configurable restoration tasks (appearance, dewarping, deshadowing, deblurring, binarization, end2end)
    • GPU acceleration support for restoration processing
    • Enhanced page images saved to enhanced_pages/ directory
  • Split Table Merging: Automatic detection and merging of tables split across pages

    • Two-phase detection algorithm (proximity detection + structural validation)
    • Configurable thresholds for position, gap, and alignment tolerance
    • Merged tables processed with PaddleOCRVL for complete extraction
    • Visual merge results saved to tables/ directory
  • Structured Output Generation:

    • Markdown output with all extracted content
    • HTML output with formatted tables and charts
    • Excel file (tables.xlsx) containing all tables and charts as structured data
    • HTML file (tables.html) with structured tables and charts

🔧 Technical Improvements

  • Added platform-specific safetensors dependency handling (Linux/Windows)
  • Integrated PaddleOCRVL with existing Doctra architecture
  • Robust handling of both dictionary and string-formatted PaddleOCRVL output
  • HTML table parsing with BeautifulSoup fallback to regex-based parsing
  • Chart content parsing from pipe-delimited format to structured tables
  • Progress bars for both notebook and terminal environments

📚 Documentation

  • Added comprehensive user guide for PaddleOCRVLPDFParser
  • Updated README with parser overview, features, and usage examples
  • Added API reference documentation
  • Updated core concepts guide with parser description
  • Added usage examples in basic-usage.md
  • Updated documentation navigation and index

📦 Dependencies

  • PaddleOCR with doc-parser support (paddleocr[doc-parser]>=3.2.0)
  • PaddlePaddle GPU (paddlepaddle-gpu>=3.2.1) for CUDA 12.6
  • Platform-specific safetensors wheels:
    • Linux: safetensors-0.6.2.dev0-cp38-abi3-linux_x86_64.whl
    • Windows: safetensors-0.6.2.dev0-cp38-abi3-win_amd64.whl
  • Updated installation documentation with platform-specific instructions

🔄 Migration Notes

  • No breaking changes: New parser is additive
  • Existing parsers remain unchanged
  • PaddleOCR dependencies are optional (parser raises ImportError if not available)
  • Users can opt-in by installing PaddleOCR dependencies

📝 Files Changed

Python Code:

  • doctra/parsers/paddleocr_vl_parser.py - Main parser implementation
  • doctra/parsers/__init__.py - Export new parser
  • doctra/__init__.py - Package-level export

Documentation:

  • README.md - Added parser documentation and examples
  • docs/user-guide/parsers/paddleocr-vl-parser.md - Comprehensive user guide
  • docs/api/parsers.md - API reference
  • docs/index.md - Updated parser table
  • docs/user-guide/core-concepts.md - Added parser description
  • docs/examples/basic-usage.md - Added usage example
  • mkdocs.yml - Updated navigation

Installation:

  • requirements.txt - Added PaddleOCR and safetensors dependencies
  • setup.py - Updated with PaddleOCR dependencies
  • pyproject.toml - Updated with PaddleOCR dependencies
  • docs/getting-started/installation.md - Added installation instructions

✅ Checklist

  • Tests pass (parser implementation complete)
  • Code formatted and linted
  • Documentation updated
  • Type hints included
  • Docstrings follow project format
  • No breaking changes
  • Dependencies properly declared

…ging

- Implement PaddleOCRVLPDFParser class for end-to-end document parsing
- Integrate DocRes image restoration support
- Add split table detection and merging capabilities
- Support chart recognition and conversion to structured tables
- Generate Markdown, HTML, and Excel outputs
- Handle multiple content types (headers, text, tables, charts, footnotes)
- Export parser in doctra.parsers and doctra package __init__
- Add PaddleOCRVLPDFParser to core components section
- Include feature overview and key capabilities
- Add basic and advanced usage examples
- Document output structure and content types
- Update features section with PaddleOCRVL capabilities
- Add usage example in examples section
- Add paddlepaddle-gpu and paddleocr[doc-parser] to requirements.txt
- Add platform-specific safetensors wheels (Linux/Windows)
- Update setup.py with PaddleOCR dependencies and platform markers
- Update pyproject.toml with PaddleOCR dependencies
- Add installation instructions for PaddleOCR in installation.md
- Include platform-specific safetensors installation notes
- Create user guide for PaddleOCRVLPDFParser
- Add API reference documentation
- Update core concepts with parser description
- Add usage examples in basic-usage.md
- Update documentation index and navigation
- Include configuration options and parameter reference
- Document chart recognition and split table merging features
@AdemBoukhris457 AdemBoukhris457 self-assigned this Nov 15, 2025
@AdemBoukhris457 AdemBoukhris457 added documentation Improvements or additions to documentation enhancement New feature or request labels Nov 15, 2025
@AdemBoukhris457 AdemBoukhris457 merged commit e28566b into main Nov 15, 2025
1 check passed
@AdemBoukhris457 AdemBoukhris457 deleted the feature/paddleocr_vl_parser branch November 15, 2025 14:28
@AdemBoukhris457 AdemBoukhris457 restored the feature/paddleocr_vl_parser branch November 15, 2025 16:03
@AdemBoukhris457 AdemBoukhris457 deleted the feature/paddleocr_vl_parser branch November 15, 2025 16:03
AdemBoukhris457 added a commit that referenced this pull request Nov 15, 2025
Features:
- Add PaddleOCRVL PDF parser with restoration and split table merging (#82)
- Add split table merging support to EnhancedPDFParser (#80)
- Add split table merging support to ChartTablePDFParser (#81)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant