How to extract text from multi-column documents in human reading order #1410

KartikeyaGorantla · 2025-04-17T09:48:21Z

KartikeyaGorantla
Apr 17, 2025

Hi,

First of all, amazing work on Docling - I’ve only recently discovered it and I’m really impressed with its capabilities!

I’m currently trying to extract text from PDFs with complex layouts - think newspapers or magazines - which often contain multiple columns, images, and tables. While the OCR and layout detection seem to be working well, I'm struggling to extract the text in human reading order, i.e., reading one column fully before moving to the next.

From past experience, using Tesseract with --psm 1 (automatic page segmentation with OSD) helps solve this issue, as it handles multi-column layouts fairly well. However, I haven’t been able to figure out how to pass this configuration (like setting --psm 1) when using Docling.

Is there a recommended way to do this in Docling? Or a workaround for ensuring the reading order reflects how a person would naturally read the document?

Any guidance would be much appreciated!

Thanks again for building this great tool.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to extract text from multi-column documents in human reading order #1410

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to extract text from multi-column documents in human reading order #1410

Uh oh!

Uh oh!

KartikeyaGorantla Apr 17, 2025

Replies: 0 comments

KartikeyaGorantla
Apr 17, 2025