How to extract text from multi-column documents in human reading order #1410
Unanswered
KartikeyaGorantla
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
First of all, amazing work on Docling - I’ve only recently discovered it and I’m really impressed with its capabilities!
I’m currently trying to extract text from PDFs with complex layouts - think newspapers or magazines - which often contain multiple columns, images, and tables. While the OCR and layout detection seem to be working well, I'm struggling to extract the text in human reading order, i.e., reading one column fully before moving to the next.
From past experience, using Tesseract with --psm 1 (automatic page segmentation with OSD) helps solve this issue, as it handles multi-column layouts fairly well. However, I haven’t been able to figure out how to pass this configuration (like setting --psm 1) when using Docling.
Is there a recommended way to do this in Docling? Or a workaround for ensuring the reading order reflects how a person would naturally read the document?
Any guidance would be much appreciated!
Thanks again for building this great tool.
Beta Was this translation helpful? Give feedback.
All reactions