-
Notifications
You must be signed in to change notification settings - Fork 46
Description
I was using Docling to parse tables in PDF documents. The library does not correctly detect and parse table cells.
These are my pipeline options -
pipeline_options = PdfPipelineOptions(
do_ocr=False,
do_table_structure=True,
do_picture_classification=False,
do_picture_description=False,
do_code_enrichment=False,
do_formula_enrichment=False
)
# Add table-specific optimizations
pipeline_options.table_structure_options.do_cell_matching = False # Critical for borderless tables
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE # Use accurate mode
# Use the faster V2 backend (10x faster PDF loading)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
backend=DoclingParseV2DocumentBackend # Latest backend
)
}
)
result = converter.convert(pdf_path)
doc = result.document
Here is the table snippet -
This is the parsed response or the table cells -
[TableCell(bbox=BoundingBox(l=71.999, t=216.486, r=119.452, b=239.66200000000003, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=0, end_row_offset_idx=1, start_col_offset_idx=0, end_col_offset_idx=1, text='FORM NUMBER', column_header=False, row_header=False, row_section=False, fillable=False),
TableCell(bbox=BoundingBox(l=198.009, t=216.486, r=245.451, b=239.66200000000003, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=0, end_row_offset_idx=1, start_col_offset_idx=1, end_col_offset_idx=2, text='BENEFIT NUMBER', column_header=True, row_header=False, row_section=False, fillable=False),
TableCell(bbox=BoundingBox(l=252.006, t=229.20399999999995, r=327.403, b=239.66200000000003, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=0, end_row_offset_idx=1, start_col_offset_idx=2, end_col_offset_idx=3, text='DESCRIPTION', column_header=True, row_header=False, row_section=False, fillable=False),
TableCell(bbox=BoundingBox(l=71.965, t=254.64, r=185.788, b=417.715, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=1, end_row_offset_idx=2, start_col_offset_idx=0, end_col_offset_idx=1, text='C11911DBG C11912DBG C11914DBG (Rev C11920(Rev 8/16)DBG C11923DBG C11932DBG C11933(Rev 8/16)DBG C11935DBG (Rev C36048DBG C36159(Rev 8/16)DBG C36161(Rev 8/16)DBG C36182DBG C36286DBG', column_header=False, row_header=False, row_section=False, fillable=False),
TableCell(bbox=BoundingBox(l=197.998, t=254.64, r=215.187, b=290.534, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=1, end_row_offset_idx=2, start_col_offset_idx=1, end_col_offset_idx=2, text='B-4 B-5 B-7', column_header=False, row_header=False, row_section=False, fillable=False),
TableCell(bbox=BoundingBox(l=251.962, t=254.64, r=475.937, b=392.279, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=1, end_row_offset_idx=2, start_col_offset_idx=2, end_col_offset_idx=3, text='Bereavement and Trauma Counseling Benefit Carjacking Benefit (Dollar Amount) Coma Benefit Emergency Evacuation with Family Travel Home Alteration and Vehicle Modification Rehabilitation Benefit Repatriation of Remains Seat Belt and Air Bag Benefit Security Evacuation Benefit Out of Country Medical Expense Benefit Attendor Benefit', column_header=False, row_header=False, row_section=False, fillable=False),
TableCell(bbox=BoundingBox(l=163.784, t=280.076, r=188.929, b=290.534, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=2, end_row_offset_idx=3, start_col_offset_idx=0, end_col_offset_idx=1, text='7/12)', column_header=False, row_header=False, row_section=False, fillable=False),
TableCell(bbox=BoundingBox(l=197.987, t=292.794, r=221.281, b=303.253, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=3, end_row_offset_idx=4, start_col_offset_idx=1, end_col_offset_idx=2, text='B-13', column_header=False, row_header=False, row_section=False, fillable=False),
TableCell(bbox=BoundingBox(l=460.041, t=292.794, r=494.419, b=303.253, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=3, end_row_offset_idx=4, start_col_offset_idx=2, end_col_offset_idx=3, text='Benefit', column_header=False, row_header=False, row_section=False, fillable=False),
TableCell(bbox=BoundingBox(l=197.987, t=305.512, r=221.281, b=315.971, coord_origin=<CoordOrigin.TOPLEFT: 'TOPLEFT'>), row_span=1, col_span=1, start_row_offset_idx=4, end_row_offset_idx=5, start_col_offset_idx=1, end_col_offset_idx=2, text='B-16', column_header=False, row_header=False, row_section=False, fillable=False),
Docling version - 2.54
Python - 3.11
OS - windows
Note - Looks like merged cells are being concatenated instead of split. This makes structured data extraction unreliable for downstream processing.