Table cell parsing issue: Docling misaligns/mis-parses cells in PDF tables

I was using Docling to parse tables in PDF documents. The library does not correctly detect and parse table cells.

These are my pipeline options - 

pipeline_options = PdfPipelineOptions(
            do_ocr=False,                    
            do_table_structure=True,         
            do_picture_classification=False,
            do_picture_description=False,    
            do_code_enrichment=False,   
            do_formula_enrichment=False 
        )
        
        # Add table-specific optimizations
        pipeline_options.table_structure_options.do_cell_matching = False  # Critical for borderless tables
        pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE  # Use accurate mode

        # Use the faster V2 backend (10x faster PDF loading)
        converter = DocumentConverter(
            format_options={
                InputFormat.PDF: PdfFormatOption(
                    pipeline_options=pipeline_options,
                    backend=DoclingParseV2DocumentBackend  # Latest backend
                )
            }
        )
        result = converter.convert(pdf_path)
        doc = result.document

Here is the table snippet - 

<img width="813" height="396" alt="Image" src="https://github.com/user-attachments/assets/db5513ba-e437-4f7d-9475-573a7d6b2cd7" />

This is the parsed response or the table cells -
[TableCell(bbox=BoundingBox(l=71.999, t=216.486, r=119.452, b=239.66200000000003, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=0, end_row_offset_idx=1, start_col_offset_idx=0, end_col_offset_idx=1, text='FORM NUMBER', column_header=False, row_header=False, row_section=False, fillable=False), 

TableCell(bbox=BoundingBox(l=198.009, t=216.486, r=245.451, b=239.66200000000003, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=0, end_row_offset_idx=1, start_col_offset_idx=1, end_col_offset_idx=2, text='BENEFIT NUMBER', column_header=True, row_header=False, row_section=False, fillable=False), 

TableCell(bbox=BoundingBox(l=252.006, t=229.20399999999995, r=327.403, b=239.66200000000003, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=0, end_row_offset_idx=1, start_col_offset_idx=2, end_col_offset_idx=3, text='DESCRIPTION', column_header=True, row_header=False, row_section=False, fillable=False), 

TableCell(bbox=BoundingBox(l=71.965, t=254.64, r=185.788, b=417.715, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=1, end_row_offset_idx=2, start_col_offset_idx=0, end_col_offset_idx=1, text='C11911DBG C11912DBG C11914DBG (Rev C11920(Rev 8/16)DBG C11923DBG C11932DBG C11933(Rev 8/16)DBG C11935DBG (Rev C36048DBG C36159(Rev 8/16)DBG C36161(Rev 8/16)DBG C36182DBG C36286DBG', column_header=False, row_header=False, row_section=False, fillable=False), 

TableCell(bbox=BoundingBox(l=197.998, t=254.64, r=215.187, b=290.534, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=1, end_row_offset_idx=2, start_col_offset_idx=1, end_col_offset_idx=2, text='B-4 B-5 B-7', column_header=False, row_header=False, row_section=False, fillable=False), 

TableCell(bbox=BoundingBox(l=251.962, t=254.64, r=475.937, b=392.279, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=1, end_row_offset_idx=2, start_col_offset_idx=2, end_col_offset_idx=3, text='Bereavement and Trauma Counseling Benefit Carjacking Benefit (Dollar Amount) Coma Benefit Emergency Evacuation with Family Travel Home Alteration and Vehicle Modification Rehabilitation Benefit Repatriation of Remains Seat Belt and Air Bag Benefit Security Evacuation Benefit Out of Country Medical Expense Benefit Attendor Benefit', column_header=False, row_header=False, row_section=False, fillable=False), 

TableCell(bbox=BoundingBox(l=163.784, t=280.076, r=188.929, b=290.534, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=2, end_row_offset_idx=3, start_col_offset_idx=0, end_col_offset_idx=1, text='7/12)', column_header=False, row_header=False, row_section=False, fillable=False), 

TableCell(bbox=BoundingBox(l=197.987, t=292.794, r=221.281, b=303.253, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=3, end_row_offset_idx=4, start_col_offset_idx=1, end_col_offset_idx=2, text='B-13', column_header=False, row_header=False, row_section=False, fillable=False), 

TableCell(bbox=BoundingBox(l=460.041, t=292.794, r=494.419, b=303.253, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=3, end_row_offset_idx=4, start_col_offset_idx=2, end_col_offset_idx=3, text='Benefit', column_header=False, row_header=False, row_section=False, fillable=False), 

TableCell(bbox=BoundingBox(l=197.987, t=305.512, r=221.281, b=315.971, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=4, end_row_offset_idx=5, start_col_offset_idx=1, end_col_offset_idx=2, text='B-16', column_header=False, row_header=False, row_section=False, fillable=False), 

Docling version - 2.54
Python - 3.11
OS - windows

Note - Looks like merged cells are being concatenated instead of split. This makes structured data extraction unreliable for downstream processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Table cell parsing issue: Docling misaligns/mis-parses cells in PDF tables #167

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Table cell parsing issue: Docling misaligns/mis-parses cells in PDF tables #167

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions