fix: recognize when a table is actually its title in a xlsx document #2589

glypt · 2025-11-05T18:28:53Z

Many real xlsx documents have a 1x1 cell with a title of the table separated with an empty row from the table. This PR detects it and add the string inside as a caption to the table. On complex documents, the current parsing increase the number of tables by a huge amount.

This also adds an extra test with the special situation: xlsx_05_table_with_title.xlsx.

This revealed a potential bug in docling-core, the item index is not incremented here, I have the corresponding fix in docling-core and I will open a PR soonish.

mergify · 2025-11-05T18:29:28Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

github-actions · 2025-11-05T19:02:39Z

✅ DCO Check Passed

Thanks @glypt, all your commits are properly signed off. 🎉

dosubot · 2025-11-06T06:42:56Z

Related Documentation

Checked 3 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

codecov · 2025-11-06T10:06:01Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

ceberam · 2025-11-07T09:48:45Z

Thanks @glypt for your contributions to Docling

Many real xlsx documents have a 1x1 cell with a title of the table separated with an empty row from the table.

Do you have any evidence that this pattern is common in real xlsx documents or if this is a best practice from the Open XML standard?
I'm just trying to understand if the rule you are proposing is specific to some use cases or it is generic enough to apply by default to any xlsx converted by Docling. I am also wondering the benefit of having a caption attached to the table compared to the confusion that it could potentially introduce.

You also mention a distance of an empty row but your rule would apply to any number of rows and columns between the text cell and the first table underneath. What happens if there is a text cell just two rows below the table? Wouldn't it be a candidate for a caption?

On complex documents, the current parsing increase the number of tables by a huge amount.

Could you share an example of such document and explain why applying this heuristic for caption detection would significantly reduce the number of tables?

This revealed a potential bug in docling-core, the item index is not incremented here, I have the corresponding fix in docling-core and I will open a PR soonish.

Thanks for spotting this bug!

glypt · 2025-11-07T12:13:57Z

Thanks @glypt for your contributions to Docling

Many real xlsx documents have a 1x1 cell with a title of the table separated with an empty row from the table.

Do you have any evidence that this pattern is common in real xlsx documents or if this is a best practice from the Open XML standard? I'm just trying to understand if the rule you are proposing is specific to some use cases or it is generic enough to apply by default to any xlsx converted by Docling. I am also wondering the benefit of having a caption attached to the table compared to the confusion that it could potentially introduce.

I guess the evidence I have is with the confidential dataset I work on. I think in general a 1x1 cell couldn't be a table, for me a table is mininum 2x1 or 1x2. I don't have a strong opinion on adding it as a caption or simply dropping that cell, this is up to you. Having it as a caption is just adding more context when quoting documents later on.

You also mention a distance of an empty row but your rule would apply to any number of rows and columns between the text cell and the first table underneath. What happens if there is a text cell just two rows below the table? Wouldn't it be a candidate for a caption?

I don't have a upper boundary for the distance between the title and the table, I can add one if necessary but it's very unlikely that a 1x1 cell is so many lines above a table I guess, I can definitely add a upper boundary to the number of rows if necessary. However a title below a table is not considered, thanks to this check https://github.com/docling-project/docling/pull/2589/files#diff-c69c6a7b1b56f6322d3f35c63f88728ac7028fd333e6cbeb8d293883fdea58d3R439

On complex documents, the current parsing increase the number of tables by a huge amount.

Could you share an example of such document and explain why applying this heuristic for caption detection would significantly reduce the number of tables?

The documents I'm treating are confidential, but I can recreate a toy example with random numbers, will do that in the next days :)

This revealed a potential bug in docling-core, the item index is not incremented here, I have the corresponding fix in docling-core and I will open a PR soonish.

Thanks for spotting this bug!

Signed-off-by: glypt <[email protected]>

glypt · 2025-11-07T12:24:01Z

I'm pushing the fix for the formatting now, I'm waiting for your feedback for the rest :)

glypt · 2025-11-07T12:38:00Z

test.xlsx

Here is one example of document, I modified all the data, and the three tables here are copy pasted (even though different on the original document)

glypt force-pushed the fix_title_parsing_xls branch 2 times, most recently from dd0eb51 to e60ef92 Compare November 6, 2025 06:42

glypt marked this pull request as ready for review November 6, 2025 06:42

ceberam added the xlsx issue related to xlsx backend label Nov 6, 2025

PeterStaar-IBM requested review from ceberam and maxmnemonic November 7, 2025 04:29

fix: recognize when a table is actually its title in a xlsx document

947963a

Signed-off-by: glypt <[email protected]>

glypt force-pushed the fix_title_parsing_xls branch from e60ef92 to 947963a Compare November 7, 2025 12:24

glypt mentioned this pull request Nov 7, 2025

fix: item numbering in document docling-project/docling-core#416

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: recognize when a table is actually its title in a xlsx document #2589

fix: recognize when a table is actually its title in a xlsx document #2589

Uh oh!

glypt commented Nov 5, 2025

Uh oh!

mergify bot commented Nov 5, 2025

Uh oh!

github-actions bot commented Nov 5, 2025 •

edited

Loading

Uh oh!

dosubot bot commented Nov 6, 2025

Uh oh!

codecov bot commented Nov 6, 2025

Uh oh!

ceberam commented Nov 7, 2025

Uh oh!

glypt commented Nov 7, 2025

Uh oh!

glypt commented Nov 7, 2025

Uh oh!

glypt commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: recognize when a table is actually its title in a xlsx document #2589

Are you sure you want to change the base?

fix: recognize when a table is actually its title in a xlsx document #2589

Uh oh!

Conversation

glypt commented Nov 5, 2025

Uh oh!

mergify bot commented Nov 5, 2025

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

github-actions bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dosubot bot commented Nov 6, 2025

Uh oh!

codecov bot commented Nov 6, 2025

Codecov Report

Uh oh!

ceberam commented Nov 7, 2025

Uh oh!

glypt commented Nov 7, 2025

Uh oh!

glypt commented Nov 7, 2025

Uh oh!

glypt commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Nov 5, 2025 •

edited

Loading