feat: Add PaddleOCRVL PDF parser with restoration and split table merging by AdemBoukhris457 · Pull Request #82 · AdemBoukhris457/Doctra

AdemBoukhris457 · 2025-11-15T14:28:45Z

✨ New Features

PaddleOCRVL PDF Parser: Added new PaddleOCRVLPDFParser class for end-to-end document parsing
- Uses PaddleOCRVL Vision-Language Model for comprehensive document understanding
- Single-pass extraction of all content types (headers, text, tables, charts, footnotes, figure titles)
- Automatic chart recognition and conversion to structured table format
- Support for pipe-delimited chart content parsing
Integrated Document Restoration: Optional DocRes image restoration support
- Configurable restoration tasks (appearance, dewarping, deshadowing, deblurring, binarization, end2end)
- GPU acceleration support for restoration processing
- Enhanced page images saved to enhanced_pages/ directory
Split Table Merging: Automatic detection and merging of tables split across pages
- Two-phase detection algorithm (proximity detection + structural validation)
- Configurable thresholds for position, gap, and alignment tolerance
- Merged tables processed with PaddleOCRVL for complete extraction
- Visual merge results saved to tables/ directory
Structured Output Generation:
- Markdown output with all extracted content
- HTML output with formatted tables and charts
- Excel file (tables.xlsx) containing all tables and charts as structured data
- HTML file (tables.html) with structured tables and charts

🔧 Technical Improvements

Added platform-specific safetensors dependency handling (Linux/Windows)
Integrated PaddleOCRVL with existing Doctra architecture
Robust handling of both dictionary and string-formatted PaddleOCRVL output
HTML table parsing with BeautifulSoup fallback to regex-based parsing
Chart content parsing from pipe-delimited format to structured tables
Progress bars for both notebook and terminal environments

📚 Documentation

Added comprehensive user guide for PaddleOCRVLPDFParser
Updated README with parser overview, features, and usage examples
Added API reference documentation
Updated core concepts guide with parser description
Added usage examples in basic-usage.md
Updated documentation navigation and index

📦 Dependencies

PaddleOCR with doc-parser support (paddleocr[doc-parser]>=3.2.0)
PaddlePaddle GPU (paddlepaddle-gpu>=3.2.1) for CUDA 12.6
Platform-specific safetensors wheels:
- Linux: safetensors-0.6.2.dev0-cp38-abi3-linux_x86_64.whl
- Windows: safetensors-0.6.2.dev0-cp38-abi3-win_amd64.whl
Updated installation documentation with platform-specific instructions

🔄 Migration Notes

No breaking changes: New parser is additive
Existing parsers remain unchanged
PaddleOCR dependencies are optional (parser raises ImportError if not available)
Users can opt-in by installing PaddleOCR dependencies

📝 Files Changed

Python Code:

doctra/parsers/paddleocr_vl_parser.py - Main parser implementation
doctra/parsers/__init__.py - Export new parser
doctra/__init__.py - Package-level export

Documentation:

README.md - Added parser documentation and examples
docs/user-guide/parsers/paddleocr-vl-parser.md - Comprehensive user guide
docs/api/parsers.md - API reference
docs/index.md - Updated parser table
docs/user-guide/core-concepts.md - Added parser description
docs/examples/basic-usage.md - Added usage example
mkdocs.yml - Updated navigation

Installation:

requirements.txt - Added PaddleOCR and safetensors dependencies
setup.py - Updated with PaddleOCR dependencies
pyproject.toml - Updated with PaddleOCR dependencies
docs/getting-started/installation.md - Added installation instructions

✅ Checklist

…ging - Implement PaddleOCRVLPDFParser class for end-to-end document parsing - Integrate DocRes image restoration support - Add split table detection and merging capabilities - Support chart recognition and conversion to structured tables - Generate Markdown, HTML, and Excel outputs - Handle multiple content types (headers, text, tables, charts, footnotes) - Export parser in doctra.parsers and doctra package __init__

- Add PaddleOCRVLPDFParser to core components section - Include feature overview and key capabilities - Add basic and advanced usage examples - Document output structure and content types - Update features section with PaddleOCRVL capabilities - Add usage example in examples section

- Add paddlepaddle-gpu and paddleocr[doc-parser] to requirements.txt - Add platform-specific safetensors wheels (Linux/Windows) - Update setup.py with PaddleOCR dependencies and platform markers - Update pyproject.toml with PaddleOCR dependencies - Add installation instructions for PaddleOCR in installation.md - Include platform-specific safetensors installation notes

- Create user guide for PaddleOCRVLPDFParser - Add API reference documentation - Update core concepts with parser description - Add usage examples in basic-usage.md - Update documentation index and navigation - Include configuration options and parameter reference - Document chart recognition and split table merging features

Features: - Add PaddleOCRVL PDF parser with restoration and split table merging (#82) - Add split table merging support to EnhancedPDFParser (#80) - Add split table merging support to ChartTablePDFParser (#81)

AdemBoukhris457 added 4 commits November 15, 2025 15:25

AdemBoukhris457 self-assigned this Nov 15, 2025

AdemBoukhris457 added documentation Improvements or additions to documentation enhancement New feature or request labels Nov 15, 2025

AdemBoukhris457 merged commit e28566b into main Nov 15, 2025
1 check passed

AdemBoukhris457 deleted the feature/paddleocr_vl_parser branch November 15, 2025 14:28

AdemBoukhris457 restored the feature/paddleocr_vl_parser branch November 15, 2025 16:03

AdemBoukhris457 deleted the feature/paddleocr_vl_parser branch November 15, 2025 16:03

AdemBoukhris457 mentioned this pull request Nov 15, 2025

release: prepare v0.9.0 - Split Table Merging and PaddleOCRVL Parser #83

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add PaddleOCRVL PDF parser with restoration and split table merging#82

feat: Add PaddleOCRVL PDF parser with restoration and split table merging#82
AdemBoukhris457 merged 4 commits intomainfrom
feature/paddleocr_vl_parser

AdemBoukhris457 commented Nov 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AdemBoukhris457 commented Nov 15, 2025

✨ New Features

🔧 Technical Improvements

📚 Documentation

📦 Dependencies

🔄 Migration Notes

📝 Files Changed

✅ Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant