feat: Add PaddleOCRVL PDF parser with restoration and split table merging#82
Merged
AdemBoukhris457 merged 4 commits intomainfrom Nov 15, 2025
Merged
feat: Add PaddleOCRVL PDF parser with restoration and split table merging#82AdemBoukhris457 merged 4 commits intomainfrom
AdemBoukhris457 merged 4 commits intomainfrom
Conversation
…ging - Implement PaddleOCRVLPDFParser class for end-to-end document parsing - Integrate DocRes image restoration support - Add split table detection and merging capabilities - Support chart recognition and conversion to structured tables - Generate Markdown, HTML, and Excel outputs - Handle multiple content types (headers, text, tables, charts, footnotes) - Export parser in doctra.parsers and doctra package __init__
- Add PaddleOCRVLPDFParser to core components section - Include feature overview and key capabilities - Add basic and advanced usage examples - Document output structure and content types - Update features section with PaddleOCRVL capabilities - Add usage example in examples section
- Add paddlepaddle-gpu and paddleocr[doc-parser] to requirements.txt - Add platform-specific safetensors wheels (Linux/Windows) - Update setup.py with PaddleOCR dependencies and platform markers - Update pyproject.toml with PaddleOCR dependencies - Add installation instructions for PaddleOCR in installation.md - Include platform-specific safetensors installation notes
- Create user guide for PaddleOCRVLPDFParser - Add API reference documentation - Update core concepts with parser description - Add usage examples in basic-usage.md - Update documentation index and navigation - Include configuration options and parameter reference - Document chart recognition and split table merging features
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
✨ New Features
PaddleOCRVL PDF Parser: Added new
PaddleOCRVLPDFParserclass for end-to-end document parsingIntegrated Document Restoration: Optional DocRes image restoration support
enhanced_pages/directorySplit Table Merging: Automatic detection and merging of tables split across pages
tables/directoryStructured Output Generation:
tables.xlsx) containing all tables and charts as structured datatables.html) with structured tables and charts🔧 Technical Improvements
📚 Documentation
📦 Dependencies
paddleocr[doc-parser]>=3.2.0)paddlepaddle-gpu>=3.2.1) for CUDA 12.6safetensors-0.6.2.dev0-cp38-abi3-linux_x86_64.whlsafetensors-0.6.2.dev0-cp38-abi3-win_amd64.whl🔄 Migration Notes
📝 Files Changed
Python Code:
doctra/parsers/paddleocr_vl_parser.py- Main parser implementationdoctra/parsers/__init__.py- Export new parserdoctra/__init__.py- Package-level exportDocumentation:
README.md- Added parser documentation and examplesdocs/user-guide/parsers/paddleocr-vl-parser.md- Comprehensive user guidedocs/api/parsers.md- API referencedocs/index.md- Updated parser tabledocs/user-guide/core-concepts.md- Added parser descriptiondocs/examples/basic-usage.md- Added usage examplemkdocs.yml- Updated navigationInstallation:
requirements.txt- Added PaddleOCR and safetensors dependenciessetup.py- Updated with PaddleOCR dependenciespyproject.toml- Updated with PaddleOCR dependenciesdocs/getting-started/installation.md- Added installation instructions✅ Checklist