Set space threshold to decide text blocks #1358

Exorcismus · 2021-11-01T00:30:30Z

Exorcismus
Nov 1, 2021

is there a way to set when the library decides if the text is within the same text block or not based on some kind of spacing threshold ?

Answered by JorjMcKie

Nov 1, 2021

No, this is decided inside MuPDF code, with no way to influence it.

View full answer

JorjMcKie · 2021-11-01T09:00:20Z

JorjMcKie
Nov 1, 2021
Maintainer

No, this is decided inside MuPDF code, with no way to influence it.

2 replies

dhirajsuvarna Oct 11, 2022

Hey @JorjMcKie,

I am trying to use the "block" to group the text in the pdf, but somehow, its not working as per the visual intuition.

Can you throw some light on where I can find the how this grouping of text is done into lines and then into blocks? Isn't it based on the distance between two words?

JorjMcKie Oct 11, 2022
Maintainer

I think I mentioned above that the logic is contained in MuPDF, and executed when the TextPage object is built.
This is a multiple factor algorithm, taking into account the text position, the text font, font size, inter-line distance, word distance and more things.
There is - and this may be important for your case - no effort to reorder text by its position on page, before this algorithm is invoked. So if the following has a reading sequence "text1 text2 text3 text4 text5" but is added to the page like this

text1 text3 text5
text2 text4

then it will be extracted in that "physical" sequence - although nothing seems to be wrong when looking at the page with a PDF viewer. Correspondingly, MuPDF's algorithm will probably detect 2 lines or even blocks.
You can extend this example even down to every single character!
So it is fair to say, that the block-line-span detection algorithm of MuPDF does work for pages made in a canonical way, but won't for the more orchid cases.
Simpler examples for unexpected results are line distances exceeding the threshold: they will land in different blocks.
Or words contained in different table columns, etc.

jemc · 2022-12-29T03:27:19Z

jemc
Dec 29, 2022

Is there any known way in PyMuPDF to split the MuPDF-detected blocks into smaller blocks based on some heuristic?

I have a PDF of a dictionary that uses a 2-column page layout with very little space between columns, such that MuPDF and PyMuPDF are treating them as a single block.

I'd like to have some way to split the block vertically based on a heuristic that can find the column break better than the out-of-the-box MuPDF algorithm seems to be able to. Is there a known way of doing this split in PyMuPDF, to split the block found by MuPDF?

2 replies

JorjMcKie Dec 29, 2022
Maintainer

It all depends on the concrete situation, you may have to try things out and see ...
For example, if you look at the list page.get_text("words") and count the frequency of the x0 values (first item of the tuples in that list), you should see high counts for the left-most and the middle x-coordinate. Here is a snippet:

import fitz
doc=fitz.open("two-columns0.pdf")
page=doc[0]  # a page with 2-column layout
words=page.get_text("words")  # read the words
x_freq = {}  # key: left coord of word, value: how often
for w in words:
    f = x_freq.get(w[0],0) + 1
    x_freq[w[0]] = f

# put dict items in list, sorted by left coord frequency
frequencies = sorted([(k,v) for k,v in x_freq.items()], key=lambda i: i[1], reverse=True)
from pprint import pprint
pprint(frequencies[:10])  # look at the 10 most popular word starts:
[(318.0, 41),  # 41 words start here
 (48.0, 27),  # 27 words start here
 (333.0, 5),
 (337.343994140625, 4),
 (106.01400756835938, 2),
 (71.35200500488281, 2),
 (228.1079559326172, 2),
 (370.0199890136719, 2),
 (463.4039611816406, 2),
 (374.68798828125, 2)]

The page looks like this:

JorjMcKie Dec 29, 2022
Maintainer

So in this case the data extracted above tell us: (1) we have a 2-column page, (2) we know the left column coordinates.
We can therefore make two sub-rectangles of the page.rect and extract text separately from each of these:

prect = page.rect
lcol = fitz.Rect(48, 0, 318, prect.height)  # rect for the left column
rcol = fitz.Rect(318, 0, prect.width, prect.height)  # rect for the right column
# we are now positioned to extract text column by column:
ltext = page.get_text(clip=lcol, sort=True)  # text in left column, sorted
rtext = page.get_text(clip=rcol, sort=True)  # text in right column, sorted

all_text = ltext + rtext  # combined page text with improved reading sequence

Note that the above clip parameter is supported for all .get_text() variants "text", "words", "blocks", "dict", and "rawdict". So the above approach works in all these cases.

Please also note, that above algorithm can help with analyzing table structures:
(Py-) MuPDF contains no logic to detect the presence or location of tables on a page. This is a very complex task, and we know of no failsafe way to do this. Existing approaches generally involve Artifical Intelligence and / or Machine Learning.
But once a boundary box of a table is known / provided, the above and related techniques are helpful for detecting columns and rows within a table. Combined with extracting text font information and vector graphics extraction, header lines and column and row delimiters can be identified.

mirix · 2023-06-06T09:08:16Z

mirix
Jun 6, 2023

In my experience, each time that a PDF is generated by print-to-pdf from HTML each line is detected as a separated block. I have tried with different options. In my opinion, this is not anecdotal, as print-to-pdf is often the most convenient way to save a dynamic web page.

0 replies

mirix · 2023-06-06T11:02:17Z

mirix
Jun 6, 2023

I have added an exception in my PDF processing pipeline:

corpus = []
for pdf in pdfs:
	fpdf = fitz.open(pdf)
	if 'cairo' in fpdf.metadata['producer']:
		inpdf = Popen(['pdftotext', '-fixed', '100', pdf, '-'], stdout=PIPE).communicate()[0]
		doc = [i.replace('\n', ' ').replace('\x0c', '').replace('  ', ' ').strip() for i in inpdf.decode('utf-8').split('\n\n')]
		for paragraph in doc:
			words = sum(map(lambda w : w.isalpha() and len(w) > 2, paragraph.split()))
			if words > 3:
				corpus.append(paragraph)
	else:
		for page in fpdf:
			blocks = page.get_text('blocks')
			for b in blocks:
				paragraph = str(b[4]).replace('\n', ' ').replace('  ', ' ').replace('  ', ' ').strip()
				words = sum(map(lambda w : w.isalpha() and len(w) > 2, paragraph.split()))
				if words > 3:
					corpus.append(paragraph)

So most documents (at least 3/4) are processed with pymupdf in block mode as this typically works best. However, this does not work when the producer is cairo (1/4 of the documents). So in such cases I use pdftotext from xpdf as, amongst the numerous tools I have tried, it is the only one that correctly identifies the paragraphs (and only if the -fixed flag is used).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Set space threshold to decide text blocks #1358

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Set space threshold to decide text blocks #1358

Uh oh!

Exorcismus Nov 1, 2021

Replies: 4 comments · 4 replies

Uh oh!

JorjMcKie Nov 1, 2021 Maintainer

Uh oh!

dhirajsuvarna Oct 11, 2022

Uh oh!

JorjMcKie Oct 11, 2022 Maintainer

Uh oh!

jemc Dec 29, 2022

Uh oh!

JorjMcKie Dec 29, 2022 Maintainer

Uh oh!

JorjMcKie Dec 29, 2022 Maintainer

Uh oh!

mirix Jun 6, 2023

Uh oh!

mirix Jun 6, 2023

Exorcismus
Nov 1, 2021

Replies: 4 comments 4 replies

JorjMcKie
Nov 1, 2021
Maintainer

JorjMcKie Oct 11, 2022
Maintainer

jemc
Dec 29, 2022

JorjMcKie Dec 29, 2022
Maintainer

JorjMcKie Dec 29, 2022
Maintainer

mirix
Jun 6, 2023

mirix
Jun 6, 2023