How to extract pdf page text line by line? #3552
Replies: 2 comments
-
This is no bug. But there is a way to get correct results. Please continue in the Discussions tab. |
Beta Was this translation helpful? Give feedback.
-
We are in the process to provide a new As you correctly write, the current approach does not solve the problem when text has been stored in some crazy sequence. The new method will work a lot better and, for instance, also extract text correctly for multi-column pages.
The new line extraction method is currently already available in our new package pymupdf4llm as follows: import pymupdf
from pymupdf4llm.helpers.get_text_lines import get_text_lines
doc = pymupdf.open("UMNwriteup.pdf")
page = doc[0]
text = get_text_lines(page)
print(text) This will produce the following output:
So in this example, the intended page layout is undetectable: 2 columns? 2 column table? The There also is a parameter
In addition to |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to extract pdf text line by line.
I have tried
UMNwriteup.pdf
Option 1
page.get_text('text').split("\n")
but that results in some lines being broken up into chunks (because spacing between words in one sentence is too much and a new line character is inputted.
Option 2
page.get_text('blocks')
That is more towards what I'm looking for, but some chunks (multi-line sentences) are intelligently grouped together.
Option 3
This results in output similar to option 2.
So how do I extract text line by line, without any chunking / blocks behinds the scenes?
If I can stop putting new line characters between two words that are separated by blank spaces (even though on same bbox height), that should solve this for me.
Hi @JorjMcKie Thanks for any help.
Beta Was this translation helpful? Give feedback.
All reactions