Table extraction: how to keep the columns empty? #201

aborruso · 2024-11-02T08:18:24Z

aborruso
Nov 2, 2024

Hi,
I have this sample PDF sample PDF.

If I try to extract content from it via cli or via python code the column with no values on page 2 is removed, making it impossible to merge the two tables.
Is there no way to output the empty columns as well?

Thank you

maxmnemonic · 2024-11-02T09:20:06Z

maxmnemonic
Nov 2, 2024
Collaborator

Hello @aborruso I see, yes indeed, we have a post processing step in our table model that removes fully empty columns and rows, as it's not obvious in every case if it's intentionally empty column (depending on table styling). In your case it's clear it's intentional.
Currently there is no direct way to control it from Docling straight, but we can add a parameter for the future version.

Thanks for suggesting this!

4 replies

aborruso Nov 2, 2024
Author

Currently there is no direct way to control it from Docling straight, but we can add a parameter for the future version.

Do I open a feature request? That seems very important to me

maxmnemonic Nov 2, 2024
Collaborator

Consider this a request, I will make an issue :)

aborruso Nov 2, 2024
Author

Thank you very much!

ColeDrain Feb 14, 2025

Hello, I seem to have come to this checkpoint too, lol, any update or workaround on this please?

maxmnemonic · 2024-11-02T09:34:18Z

maxmnemonic
Nov 2, 2024
Collaborator

Created an issue, we will look into it very soon: #204

2 replies

aborruso Dec 9, 2024
Author

Hi @maxmnemonic When do you plan to implement this? It is just a wish, I don't want to pressure you. Thank you very much

0sax Feb 15, 2025

Hi @maxmnemonic Co ask. Thanks for taking this on!

aborruso · 2024-11-02T11:13:19Z

aborruso
Nov 2, 2024
Author

Hi @maxmnemonic I'm adding here a note related to the same PDF: the last rows at page 1 are not extracted well and an extra row is produced with incorrectly arranged cells (see below).

Is it useful to create a new issue about this?

Thank you

5 replies

maxmnemonic Nov 2, 2024
Collaborator

@aborruso I see, can you please try this option:
TableFormerMode.ACCURATE
as described here: https://ds4sd.github.io/docling/usage/#control-pdf-table-extraction-options

let me know if it helps.

aborruso Nov 2, 2024
Author

TableFormerMode.ACCURATE
as described here: ds4sd.github.io/docling/usage#control-pdf-table-extraction-options

I will try it, thank you. Can you make this option available in cli use as well?

dolfim-ibm Nov 2, 2024
Maintainer

We have a PR since this morning 😉
#203

@maxmnemonic I tried already and it didn't help.

aborruso Nov 4, 2024
Author

@maxmnemonic @dolfim-ibm I have tried with the 2.4 release and it works using accurate. GREAT!!

What I am missing, is the option to choose not to delete empty columns (and rows). Which is the theme of this discussion.
I know there is an issue on this and I am happy about it.

Thank you very much

ColeDrain Feb 17, 2025

Hello @dolfim-ibm, just checking in, is there any update on keeping empty columns/rows?

AkshuChahar · 2025-06-07T14:28:46Z

AkshuChahar
Jun 7, 2025

Hi @maxmnemonic @dolfim-ibm ,

Just wanted to check if this is any updates on how to retain empty columns during text extration from PDF files. I am facing the similar issues as the ones mentioned here. Your reply on this will be greatly appreciated.

5 replies

dolfim-ibm Jun 8, 2025
Maintainer

there have been quite some updates in the tables, e.g. also a new training/fine-tuning of the tableformer model. I would suggest you try again with the accurate model.

aborruso Jun 8, 2025
Author

there have been quite some updates in the tables, e.g. also a new training/fine-tuning of the tableformer model. I would suggest you try again with the accurate model.

I have I have tried with output.pdf

docling --no-ocr --table-mode accurate output.pdf

And it deletes empty column

AkshuChahar Jun 8, 2025

@dolfim-ibm Same is the case for me. If it's not too much trouble, could you share an example where the empty columns are retained. Thanks for all the support.

maxmnemonic Jun 10, 2025
Collaborator

Right, @AkshuChahar, @aborruso indeed we have this property of a model that it deletes fully empty columns or rows in a post processing as we assume it might be a prediction anomaly.
I need to see if we can bubble up a parameter that would allow to control this behavior.

AkshuChahar Jun 10, 2025

Thanks for the response @maxmnemonic We really need this use case for one of our projects. Please let us know if there are any workarounds or if you are planning to push this update in the near future.

Table extraction: how to keep the columns empty? #201

Uh oh!

aborruso Nov 2, 2024

Replies: 4 comments · 16 replies

Uh oh!

Uh oh!

maxmnemonic Nov 2, 2024 Collaborator

Uh oh!

aborruso Nov 2, 2024 Author

Uh oh!

maxmnemonic Nov 2, 2024 Collaborator

Uh oh!

aborruso Nov 2, 2024 Author

Uh oh!

ColeDrain Feb 14, 2025

Uh oh!

maxmnemonic Nov 2, 2024 Collaborator

Uh oh!

aborruso Dec 9, 2024 Author

Uh oh!

0sax Feb 15, 2025

Uh oh!

Uh oh!

aborruso Nov 2, 2024 Author

Uh oh!

maxmnemonic Nov 2, 2024 Collaborator

Uh oh!

aborruso Nov 2, 2024 Author

Uh oh!

dolfim-ibm Nov 2, 2024 Maintainer

Uh oh!

aborruso Nov 4, 2024 Author

Uh oh!

ColeDrain Feb 17, 2025

Uh oh!

AkshuChahar Jun 7, 2025

Uh oh!

dolfim-ibm Jun 8, 2025 Maintainer

Uh oh!

aborruso Jun 8, 2025 Author

Uh oh!

AkshuChahar Jun 8, 2025

Uh oh!

maxmnemonic Jun 10, 2025 Collaborator

Uh oh!

AkshuChahar Jun 10, 2025

aborruso
Nov 2, 2024

Replies: 4 comments 16 replies

maxmnemonic
Nov 2, 2024
Collaborator

aborruso Nov 2, 2024
Author

maxmnemonic Nov 2, 2024
Collaborator

aborruso Nov 2, 2024
Author

maxmnemonic
Nov 2, 2024
Collaborator

aborruso Dec 9, 2024
Author

aborruso
Nov 2, 2024
Author

maxmnemonic Nov 2, 2024
Collaborator

aborruso Nov 2, 2024
Author

dolfim-ibm Nov 2, 2024
Maintainer

aborruso Nov 4, 2024
Author

AkshuChahar
Jun 7, 2025

dolfim-ibm Jun 8, 2025
Maintainer

aborruso Jun 8, 2025
Author

maxmnemonic Jun 10, 2025
Collaborator