-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Description
Bug
For a reason I cannot determine, the attached file causes the parent/child relationships to be inappropriately mapped in the Docling Document object, causing the markdown (or any other) export to be incorrect. If you inspect the Document JSON, the table is correctly represented in the "Tables" array, but when you follow the refs to the second cell of the header (texts/3), it actually has a child, which is the header of the section following the table, and everything gets consolidated there on export.
{
"self_ref": "#/texts/2",
"parent": {
"$ref": "#/groups/1"
},
"children": [],
"content_layer": "body",
"label": "section_header",
"prov": [],
"orig": "aaaaa",
"text": "aaaaa",
"level": 1
},
{
"self_ref": "#/texts/3",
"parent": {
"$ref": "#/groups/2"
},
"children": [
{
"$ref": "#/texts/4"
}
],
"content_layer": "body",
"label": "section_header",
"prov": [],
"orig": "aaaaaaaaa aaaaaa",
"text": "aaaaaaaaa aaaaaa",
"level": 1
},
{
"self_ref": "#/texts/4",
"parent": {
"$ref": "#/texts/3"
},
"children": [
{
"$ref": "#/groups/3"
},
{
"$ref": "#/texts/8"
},
{
"$ref": "#/tables/1"
},
{
"$ref": "#/texts/11"
}
],
"content_layer": "body",
"label": "section_header",
"prov": [],
"orig": "aaaaaaaa",
"text": "aaaaaaaa",
"level": 1
},
...
Steps to reproduce
from docling.document_converter import DocumentConverter
filename='docling-table-bug.docx'
converter = DocumentConverter()
result = converter.convert(filename)
print(result.document.export_to_markdown()) Observe the generated markdown is
## aaaaaaaaaa
[aaaa aaaaaaaa aaa aaaaaaaa aaaaaaaaa aaaaaaaaaa aa aaaa aaaaaaaa aa aaaaaaaa aaaa aaaaaaaa. Aaa aaa aaaaaa aaaa aaaa aaa aaaaaa aaaaa aa aaaaaaaaaaa. aa aaa aaaa aaaaaaaaa aaaa aaa aaa aaaaaaa aa aaaa aaaa.]
| ## aaaaa | ## aaaaaaaaa aaaaaa ## aaaaaaaa aaaaa aa *aaaaaaaaaa aaa aaaaaaaa aaaaaaaa* (aaa aaaaaaaaaa) aaa aaa aaaaa aaaaaa aaaaa aaa aaaaa aaaaaaaaaa aa aaaa aaaaaaa aa aaa aaaaa aaaaa. [aaaaaaa aaaaaaaaaaa aaa aaaaaaaaaaaaa aaa aaaaa aaaa aaaaaa aaaa aaaa aaaaaaaaaaa aaa aaaaaaaa aaaaaaaa, aaaaaa aaa aaaaa aa aaa aaaaaa. aaaa aaaaaaaaaaa aaa aaaaaaaaaaaaa aa aaaaaaaaaaaa aaaaa.] | ## aaaa [/](https://file+.vscode-resource.vscode-cdn.net/) Aaaaaaa | ## aaaaaaaaaa | |---------------------|-----------------------------| | aaa | aaaaaa aaa aaaaaaaaaaa aaaa | | aaa | aaaaaa aaaaa aaaaaaaa | |
|---------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| aaaaaaaa aaaaaaaaaaaaa aaaaaaaaaa aa | 67547856 |
| aaaaaaaa aaaaaaaaaaaaa aa | 68975669 |
| aaaaaaaa aaaaaaaaa aaa aaaaaaaaaa aaaaaaaaa | 687956453 |
| aaaaaaaaa aaaaaaaaaaa aaaaaaaaa | 896756454687 |
it puts the entire last half of the document into the header cell of the table.
Docling version
docling --version
2025-11-20 15:50:56,527 - INFO - Loading plugin 'docling_defaults'
2025-11-20 15:50:56,529 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
Docling version: 2.62.0
Docling Core version: 2.51.1
Docling IBM Models version: 3.10.2
Docling Parse version: 4.7.0
Python: cpython-312 (3.12.9)
Platform: macOS-15.7.2-arm64-arm-64bit
Python version
python --version
Python 3.12.9