-
Notifications
You must be signed in to change notification settings - Fork 3.1k
fix: recognize when a table is actually its title in a xlsx document #2589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
✅ DCO Check Passed Thanks @glypt, all your commits are properly signed off. 🎉 |
dd0eb51 to
e60ef92
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Thanks @glypt for your contributions to Docling
Do you have any evidence that this pattern is common in real xlsx documents or if this is a best practice from the Open XML standard? You also mention a distance of an empty row but your rule would apply to any number of rows and columns between the text cell and the first table underneath. What happens if there is a text cell just two rows below the table? Wouldn't it be a candidate for a caption?
Could you share an example of such document and explain why applying this heuristic for caption detection would significantly reduce the number of tables?
Thanks for spotting this bug! |
I guess the evidence I have is with the confidential dataset I work on. I think in general a 1x1 cell couldn't be a table, for me a table is mininum 2x1 or 1x2. I don't have a strong opinion on adding it as a caption or simply dropping that cell, this is up to you. Having it as a caption is just adding more context when quoting documents later on.
I don't have a upper boundary for the distance between the title and the table, I can add one if necessary but it's very unlikely that a 1x1 cell is so many lines above a table I guess, I can definitely add a upper boundary to the number of rows if necessary. However a title below a table is not considered, thanks to this check https://github.com/docling-project/docling/pull/2589/files#diff-c69c6a7b1b56f6322d3f35c63f88728ac7028fd333e6cbeb8d293883fdea58d3R439
The documents I'm treating are confidential, but I can recreate a toy example with random numbers, will do that in the next days :)
|
Signed-off-by: glypt <[email protected]>
|
I'm pushing the fix for the formatting now, I'm waiting for your feedback for the rest :) |
e60ef92 to
947963a
Compare
|
Here is one example of document, I modified all the data, and the three tables here are copy pasted (even though different on the original document) |
Many real xlsx documents have a 1x1 cell with a title of the table separated with an empty row from the table. This PR detects it and add the string inside as a caption to the table. On complex documents, the current parsing increase the number of tables by a huge amount.
This also adds an extra test with the special situation:
xlsx_05_table_with_title.xlsx.This revealed a potential bug in docling-core, the item index is not incremented here, I have the corresponding fix in docling-core and I will open a PR soonish.