-
Hello everyone, in the following code, import fitz
pdf_ path = "path/to/some.pdf"
doc = fitz.open(pdf_path)
for page in doc:
blocks = page.get_text("blocks")
line_of_words = page.get_text("words") `blocks' is a set of paragraphs, while 'line_of_words' is a set of more fine-grained split lines. My question is, which algorithm is used to aggregate line_of_words into blocks? Is there any relevant literature/code/pseudocode for reference? thank you |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
This is a typical "Discussions" post, so I will first transfer it to there. |
Beta Was this translation helpful? Give feedback.
-
Have a look at the documentation of class A Textpage is created by the underlying C-library MuPDF. It works for all document types - not just PDF. Inside MuPDF are also all the heuristic algorithms to separate a page's text into blocks, lines and text spans. The text extraction variants inside PyMuPDF are each targeted at specific needs in terms of ease of use, desired detail level and performance - but as said are all footed on a TextPage. So "blocks" iterates over the blocks inside a TextPage and delivers a list with concatenated lines of each block. And "words" iterates over the full TextPage hierarchy block -> line -> character to identify strings without intermittent spaces. This code is also open source written in C, but contained in PyMuPDF's repository. |
Beta Was this translation helpful? Give feedback.
Have a look at the documentation of class
TextPage
. Every text extraction and text search in PyMuPDF is based on a TextPage.A Textpage is created by the underlying C-library MuPDF. It works for all document types - not just PDF. Inside MuPDF are also all the heuristic algorithms to separate a page's text into blocks, lines and text spans.
As this is open source, you are free to study the respective C code.
The text extraction variants inside PyMuPDF are each targeted at specific needs in terms of ease of use, desired detail level and performance - but as said are all footed on a TextPage.
So "blocks" iterates over the blocks inside a TextPage and delivers a list with concatenated lines o…