Skip to content

feat: Add split table merging support to EnhancedPDFParser#80

Merged
AdemBoukhris457 merged 3 commits intomainfrom
feature/enhanced_parser_split_table_merging
Nov 15, 2025
Merged

feat: Add split table merging support to EnhancedPDFParser#80
AdemBoukhris457 merged 3 commits intomainfrom
feature/enhanced_parser_split_table_merging

Conversation

@AdemBoukhris457
Copy link
Owner

Summary

This PR adds split table merging functionality to EnhancedPDFParser, enabling automatic detection and merging of tables that span across multiple pages. This feature was previously only available in StructuredPDFParser.

Changes

Code Changes

  • doctra/parsers/enhanced_pdf_parser.py:
    • Added merge_split_tables parameter and related configuration options to __init__
    • Implemented split table detection logic in _process_parsing_logic
    • Added logic to skip individual table segments that are part of merged tables
    • Implemented processing and saving of merged table images
    • Added notification message when split tables are detected

Documentation Changes

  • docs/user-guide/parsers/enhanced-parser.md:

    • Added split table merging to key features
    • Added comprehensive section documenting the feature
    • Included configuration examples and parameter details
    • Added guidance on when to use the feature
  • docs/api/parsers.md:

    • Updated EnhancedPDFParser quick reference with split table merging parameters
    • Added note that feature is available for both parsers

Features

  • Automatic detection of tables split across page boundaries using two-phase approach:
    • Phase 1: Proximity detection (spatial heuristics)
    • Phase 2: Structural validation (LSD-based column alignment)
  • Merged tables are saved as single composite images
  • Full VLM support for merged table extraction
  • Configurable thresholds and confidence scores
  • Comprehensive error handling

Configuration

The feature can be enabled by passing merge_split_tables=True when creating an EnhancedPDFParser instance:

parser = EnhancedPDFParser(
    use_image_restoration=True,
    merge_split_tables=True
)

- Add merge_split_tables parameter and related configuration options
- Implement split table detection and merging logic in _process_parsing_logic
- Skip individual table segments that are part of merged tables
- Process and save merged table images with VLM support
- Add notification message when split tables are detected
- Add split table merging to key features list
- Document configuration options and parameters
- Explain how the two-phase detection algorithm works
- Describe output format and file locations
- Add guidance on when to use split table merging
- Link to comprehensive split table merging guide
…rser

- Add split table merging parameters to EnhancedPDFParser quick reference
- Include all configuration options in code example
- Add note that split table merging is available for both parsers
@AdemBoukhris457 AdemBoukhris457 self-assigned this Nov 15, 2025
@AdemBoukhris457 AdemBoukhris457 added documentation Improvements or additions to documentation enhancement New feature or request labels Nov 15, 2025
@AdemBoukhris457 AdemBoukhris457 merged commit 87fad48 into main Nov 15, 2025
1 check passed
@AdemBoukhris457 AdemBoukhris457 deleted the feature/enhanced_parser_split_table_merging branch November 15, 2025 09:11
AdemBoukhris457 added a commit that referenced this pull request Nov 15, 2025
Features:
- Add PaddleOCRVL PDF parser with restoration and split table merging (#82)
- Add split table merging support to EnhancedPDFParser (#80)
- Add split table merging support to ChartTablePDFParser (#81)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant