feat: Add split table merging support to EnhancedPDFParser by AdemBoukhris457 · Pull Request #80 · AdemBoukhris457/Doctra

AdemBoukhris457 · 2025-11-15T09:10:53Z

Summary

This PR adds split table merging functionality to EnhancedPDFParser, enabling automatic detection and merging of tables that span across multiple pages. This feature was previously only available in StructuredPDFParser.

Changes

Code Changes

doctra/parsers/enhanced_pdf_parser.py:
- Added merge_split_tables parameter and related configuration options to __init__
- Implemented split table detection logic in _process_parsing_logic
- Added logic to skip individual table segments that are part of merged tables
- Implemented processing and saving of merged table images
- Added notification message when split tables are detected

Documentation Changes

docs/user-guide/parsers/enhanced-parser.md:
- Added split table merging to key features
- Added comprehensive section documenting the feature
- Included configuration examples and parameter details
- Added guidance on when to use the feature
docs/api/parsers.md:
- Updated EnhancedPDFParser quick reference with split table merging parameters
- Added note that feature is available for both parsers

Features

Automatic detection of tables split across page boundaries using two-phase approach:
- Phase 1: Proximity detection (spatial heuristics)
- Phase 2: Structural validation (LSD-based column alignment)
Merged tables are saved as single composite images
Full VLM support for merged table extraction
Configurable thresholds and confidence scores
Comprehensive error handling

Configuration

The feature can be enabled by passing merge_split_tables=True when creating an EnhancedPDFParser instance:

parser = EnhancedPDFParser(
    use_image_restoration=True,
    merge_split_tables=True
)

- Add merge_split_tables parameter and related configuration options - Implement split table detection and merging logic in _process_parsing_logic - Skip individual table segments that are part of merged tables - Process and save merged table images with VLM support - Add notification message when split tables are detected

- Add split table merging to key features list - Document configuration options and parameters - Explain how the two-phase detection algorithm works - Describe output format and file locations - Add guidance on when to use split table merging - Link to comprehensive split table merging guide

…rser - Add split table merging parameters to EnhancedPDFParser quick reference - Include all configuration options in code example - Add note that split table merging is available for both parsers

Features: - Add PaddleOCRVL PDF parser with restoration and split table merging (#82) - Add split table merging support to EnhancedPDFParser (#80) - Add split table merging support to ChartTablePDFParser (#81)

AdemBoukhris457 added 3 commits November 15, 2025 10:07

docs: update API reference with split table merging for EnhancedPDFPa…

c26dab4

…rser - Add split table merging parameters to EnhancedPDFParser quick reference - Include all configuration options in code example - Add note that split table merging is available for both parsers

AdemBoukhris457 self-assigned this Nov 15, 2025

AdemBoukhris457 added documentation Improvements or additions to documentation enhancement New feature or request labels Nov 15, 2025

AdemBoukhris457 merged commit 87fad48 into main Nov 15, 2025
1 check passed

AdemBoukhris457 deleted the feature/enhanced_parser_split_table_merging branch November 15, 2025 09:11

AdemBoukhris457 mentioned this pull request Nov 15, 2025

release: prepare v0.9.0 - Split Table Merging and PaddleOCRVL Parser #83

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add split table merging support to EnhancedPDFParser#80

feat: Add split table merging support to EnhancedPDFParser#80
AdemBoukhris457 merged 3 commits intomainfrom
feature/enhanced_parser_split_table_merging

AdemBoukhris457 commented Nov 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AdemBoukhris457 commented Nov 15, 2025

Summary

Changes

Code Changes

Documentation Changes

Features

Configuration

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant