Skip to content

Conversation

@glypt
Copy link
Contributor

@glypt glypt commented Nov 5, 2025

Many real xlsx documents have a 1x1 cell with a title of the table separated with an empty row from the table. This PR detects it and add the string inside as a caption to the table. On complex documents, the current parsing increase the number of tables by a huge amount.

This also adds an extra test with the special situation: xlsx_05_table_with_title.xlsx.

This revealed a potential bug in docling-core, the item index is not incremented here, I have the corresponding fix in docling-core and I will open a PR soonish.

@mergify
Copy link

mergify bot commented Nov 5, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

DCO Check Passed

Thanks @glypt, all your commits are properly signed off. 🎉

@glypt glypt force-pushed the fix_title_parsing_xls branch 2 times, most recently from dd0eb51 to e60ef92 Compare November 6, 2025 06:42
@glypt glypt marked this pull request as ready for review November 6, 2025 06:42
@dosubot
Copy link

dosubot bot commented Nov 6, 2025

Related Documentation

Checked 3 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@ceberam ceberam added the xlsx issue related to xlsx backend label Nov 6, 2025
@codecov
Copy link

codecov bot commented Nov 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@ceberam
Copy link
Contributor

ceberam commented Nov 7, 2025

Thanks @glypt for your contributions to Docling

Many real xlsx documents have a 1x1 cell with a title of the table separated with an empty row from the table.

Do you have any evidence that this pattern is common in real xlsx documents or if this is a best practice from the Open XML standard?
I'm just trying to understand if the rule you are proposing is specific to some use cases or it is generic enough to apply by default to any xlsx converted by Docling. I am also wondering the benefit of having a caption attached to the table compared to the confusion that it could potentially introduce.

You also mention a distance of an empty row but your rule would apply to any number of rows and columns between the text cell and the first table underneath. What happens if there is a text cell just two rows below the table? Wouldn't it be a candidate for a caption?

On complex documents, the current parsing increase the number of tables by a huge amount.

Could you share an example of such document and explain why applying this heuristic for caption detection would significantly reduce the number of tables?

This revealed a potential bug in docling-core, the item index is not incremented here, I have the corresponding fix in docling-core and I will open a PR soonish.

Thanks for spotting this bug!

@glypt
Copy link
Contributor Author

glypt commented Nov 7, 2025

Thanks @glypt for your contributions to Docling

Many real xlsx documents have a 1x1 cell with a title of the table separated with an empty row from the table.

Do you have any evidence that this pattern is common in real xlsx documents or if this is a best practice from the Open XML standard? I'm just trying to understand if the rule you are proposing is specific to some use cases or it is generic enough to apply by default to any xlsx converted by Docling. I am also wondering the benefit of having a caption attached to the table compared to the confusion that it could potentially introduce.

I guess the evidence I have is with the confidential dataset I work on. I think in general a 1x1 cell couldn't be a table, for me a table is mininum 2x1 or 1x2. I don't have a strong opinion on adding it as a caption or simply dropping that cell, this is up to you. Having it as a caption is just adding more context when quoting documents later on.

You also mention a distance of an empty row but your rule would apply to any number of rows and columns between the text cell and the first table underneath. What happens if there is a text cell just two rows below the table? Wouldn't it be a candidate for a caption?

I don't have a upper boundary for the distance between the title and the table, I can add one if necessary but it's very unlikely that a 1x1 cell is so many lines above a table I guess, I can definitely add a upper boundary to the number of rows if necessary. However a title below a table is not considered, thanks to this check https://github.com/docling-project/docling/pull/2589/files#diff-c69c6a7b1b56f6322d3f35c63f88728ac7028fd333e6cbeb8d293883fdea58d3R439

On complex documents, the current parsing increase the number of tables by a huge amount.

Could you share an example of such document and explain why applying this heuristic for caption detection would significantly reduce the number of tables?

The documents I'm treating are confidential, but I can recreate a toy example with random numbers, will do that in the next days :)

This revealed a potential bug in docling-core, the item index is not incremented here, I have the corresponding fix in docling-core and I will open a PR soonish.

Thanks for spotting this bug!

@glypt
Copy link
Contributor Author

glypt commented Nov 7, 2025

I'm pushing the fix for the formatting now, I'm waiting for your feedback for the rest :)

@glypt glypt force-pushed the fix_title_parsing_xls branch from e60ef92 to 947963a Compare November 7, 2025 12:24
@glypt
Copy link
Contributor Author

glypt commented Nov 7, 2025

test.xlsx

Here is one example of document, I modified all the data, and the three tables here are copy pasted (even though different on the original document)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

xlsx issue related to xlsx backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants