Question about block merging algorithm #2213

shihanmax · 2023-02-06T09:36:10Z

shihanmax
Feb 6, 2023

Hello everyone, in the following code,

import fitz

pdf_ path = "path/to/some.pdf"
doc = fitz.open(pdf_path)

for page in doc:
    blocks = page.get_text("blocks")
    line_of_words = page.get_text("words")

`blocks' is a set of paragraphs, while 'line_of_words' is a set of more fine-grained split lines. My question is, which algorithm is used to aggregate line_of_words into blocks? Is there any relevant literature/code/pseudocode for reference?

thank you

Answered by JorjMcKie

Feb 6, 2023

Have a look at the documentation of class TextPage. Every text extraction and text search in PyMuPDF is based on a TextPage.

A Textpage is created by the underlying C-library MuPDF. It works for all document types - not just PDF. Inside MuPDF are also all the heuristic algorithms to separate a page's text into blocks, lines and text spans.
As this is open source, you are free to study the respective C code.

The text extraction variants inside PyMuPDF are each targeted at specific needs in terms of ease of use, desired detail level and performance - but as said are all footed on a TextPage.

So "blocks" iterates over the blocks inside a TextPage and delivers a list with concatenated lines o…

View full answer

JorjMcKie · 2023-02-06T10:04:46Z

JorjMcKie
Feb 6, 2023
Maintainer

This is a typical "Discussions" post, so I will first transfer it to there.

0 replies

JorjMcKie · 2023-02-06T10:15:27Z

JorjMcKie
Feb 6, 2023
Maintainer

Have a look at the documentation of class TextPage. Every text extraction and text search in PyMuPDF is based on a TextPage.

A Textpage is created by the underlying C-library MuPDF. It works for all document types - not just PDF. Inside MuPDF are also all the heuristic algorithms to separate a page's text into blocks, lines and text spans.
As this is open source, you are free to study the respective C code.

The text extraction variants inside PyMuPDF are each targeted at specific needs in terms of ease of use, desired detail level and performance - but as said are all footed on a TextPage.

So "blocks" iterates over the blocks inside a TextPage and delivers a list with concatenated lines of each block.

And "words" iterates over the full TextPage hierarchy block -> line -> character to identify strings without intermittent spaces.

This code is also open source written in C, but contained in PyMuPDF's repository.

2 replies

shihanmax Feb 6, 2023
Author

Thanks, I found some definations at https://github.com/pymupdf/PyMuPDF/blob/master/fitz/helper-stext.i

JorjMcKie Feb 6, 2023
Maintainer

specifically for blocks and words also have a look at extractBLOCKS() and extractWORDS() in file fitz.i.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about block merging algorithm #2213

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question about block merging algorithm #2213

Uh oh!

Uh oh!

shihanmax Feb 6, 2023

Replies: 2 comments · 2 replies

Uh oh!

JorjMcKie Feb 6, 2023 Maintainer

Uh oh!

JorjMcKie Feb 6, 2023 Maintainer

Uh oh!

shihanmax Feb 6, 2023 Author

Uh oh!

JorjMcKie Feb 6, 2023 Maintainer

shihanmax
Feb 6, 2023

Replies: 2 comments 2 replies

JorjMcKie
Feb 6, 2023
Maintainer

JorjMcKie
Feb 6, 2023
Maintainer

shihanmax Feb 6, 2023
Author

JorjMcKie Feb 6, 2023
Maintainer