Skip to content

Table cell parsing issue: Docling misaligns/mis-parses cells in PDF tables #167

@anmolgandhi007

Description

@anmolgandhi007

I was using Docling to parse tables in PDF documents. The library does not correctly detect and parse table cells.

These are my pipeline options -

pipeline_options = PdfPipelineOptions(
do_ocr=False,
do_table_structure=True,
do_picture_classification=False,
do_picture_description=False,
do_code_enrichment=False,
do_formula_enrichment=False
)

    # Add table-specific optimizations
    pipeline_options.table_structure_options.do_cell_matching = False  # Critical for borderless tables
    pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE  # Use accurate mode

    # Use the faster V2 backend (10x faster PDF loading)
    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
                backend=DoclingParseV2DocumentBackend  # Latest backend
            )
        }
    )
    result = converter.convert(pdf_path)
    doc = result.document

Here is the table snippet -

Image

This is the parsed response or the table cells -
[TableCell(bbox=BoundingBox(l=71.999, t=216.486, r=119.452, b=239.66200000000003, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=0, end_row_offset_idx=1, start_col_offset_idx=0, end_col_offset_idx=1, text='FORM NUMBER', column_header=False, row_header=False, row_section=False, fillable=False),

TableCell(bbox=BoundingBox(l=198.009, t=216.486, r=245.451, b=239.66200000000003, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=0, end_row_offset_idx=1, start_col_offset_idx=1, end_col_offset_idx=2, text='BENEFIT NUMBER', column_header=True, row_header=False, row_section=False, fillable=False),

TableCell(bbox=BoundingBox(l=252.006, t=229.20399999999995, r=327.403, b=239.66200000000003, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=0, end_row_offset_idx=1, start_col_offset_idx=2, end_col_offset_idx=3, text='DESCRIPTION', column_header=True, row_header=False, row_section=False, fillable=False),

TableCell(bbox=BoundingBox(l=71.965, t=254.64, r=185.788, b=417.715, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=1, end_row_offset_idx=2, start_col_offset_idx=0, end_col_offset_idx=1, text='C11911DBG C11912DBG C11914DBG (Rev C11920(Rev 8/16)DBG C11923DBG C11932DBG C11933(Rev 8/16)DBG C11935DBG (Rev C36048DBG C36159(Rev 8/16)DBG C36161(Rev 8/16)DBG C36182DBG C36286DBG', column_header=False, row_header=False, row_section=False, fillable=False),

TableCell(bbox=BoundingBox(l=197.998, t=254.64, r=215.187, b=290.534, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=1, end_row_offset_idx=2, start_col_offset_idx=1, end_col_offset_idx=2, text='B-4 B-5 B-7', column_header=False, row_header=False, row_section=False, fillable=False),

TableCell(bbox=BoundingBox(l=251.962, t=254.64, r=475.937, b=392.279, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=1, end_row_offset_idx=2, start_col_offset_idx=2, end_col_offset_idx=3, text='Bereavement and Trauma Counseling Benefit Carjacking Benefit (Dollar Amount) Coma Benefit Emergency Evacuation with Family Travel Home Alteration and Vehicle Modification Rehabilitation Benefit Repatriation of Remains Seat Belt and Air Bag Benefit Security Evacuation Benefit Out of Country Medical Expense Benefit Attendor Benefit', column_header=False, row_header=False, row_section=False, fillable=False),

TableCell(bbox=BoundingBox(l=163.784, t=280.076, r=188.929, b=290.534, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=2, end_row_offset_idx=3, start_col_offset_idx=0, end_col_offset_idx=1, text='7/12)', column_header=False, row_header=False, row_section=False, fillable=False),

TableCell(bbox=BoundingBox(l=197.987, t=292.794, r=221.281, b=303.253, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=3, end_row_offset_idx=4, start_col_offset_idx=1, end_col_offset_idx=2, text='B-13', column_header=False, row_header=False, row_section=False, fillable=False),

TableCell(bbox=BoundingBox(l=460.041, t=292.794, r=494.419, b=303.253, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=3, end_row_offset_idx=4, start_col_offset_idx=2, end_col_offset_idx=3, text='Benefit', column_header=False, row_header=False, row_section=False, fillable=False),

TableCell(bbox=BoundingBox(l=197.987, t=305.512, r=221.281, b=315.971, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=4, end_row_offset_idx=5, start_col_offset_idx=1, end_col_offset_idx=2, text='B-16', column_header=False, row_header=False, row_section=False, fillable=False),

Docling version - 2.54
Python - 3.11
OS - windows

Note - Looks like merged cells are being concatenated instead of split. This makes structured data extraction unreliable for downstream processing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions