Identifying paragraphs in PDF files #3133

CaiSamuelsFSA · 2024-02-05T17:37:51Z

CaiSamuelsFSA
Feb 5, 2024

Description of the bug

I am trying to extract paragraphs of text from PDF files, and as far as I'm aware, blocks seem to be the only way to identify paragraphs in PyMuPDF. The problem is that blocks often cut off paragraphs and separate them into multiple blocks.

How to reproduce the bug

Paragraphs are being cut off when extracting text in blocks.

PyMuPDF version

1.23.21

Operating system

Windows

Python version

3.11

JorjMcKie · 2024-02-05T21:38:51Z

JorjMcKie
Feb 5, 2024
Maintainer

This is no issue but a Discussions item - transferring ...

0 replies

JorjMcKie · 2024-02-05T21:58:53Z

JorjMcKie
Feb 5, 2024
Maintainer

Sorry to say that, but the MuPDF-generated hierarchy block -> line -> span -> char is and will ever remain an approximation.
There are and will ever be examples where the underlying algorithms fail. Everyone can construct such a PDF.
If you find that the blocks do not reflect what your human eye interprets as a paragraph you have no choice other than going down do deeper levels of details, like line, span or even character positions.

8 replies

JorjMcKie Feb 6, 2024
Maintainer

No general rule. Just try to ask yourself, where your intuition leads you to:

why wouldn't you join the line 5.3 with 5.3.1? Probably the different font weights (bold / non-bold).
why would you join line 5.3.1 with two subsequent lines? More difficult to answer: maybe the missing dot at end of first line? Or the (approximate!!) same indentation of the non-numeric text "The Authority ..." and then "outlined ..." in line 2?

But all that is close to semantic interpretation and is in no way an fact-implied.

MuPDF's reason for making separate blocks at this point for sure was, that the text "5.3.1 The Autority ..." and "outlined ..." are enclosed in separate text objects in the page's appearance source.
A text object is defined as source lines enclosed in the pair of keywords "BT" and "ET" ("begin text" / "end text").

We can only guess what the PDF creator's motivation was to create separate text objects: probably the extra effort to compute the relative indentation difference between the "5.3.1" line and the subsequent one: it simply was easier to start a new object "outlined ..." with an absolute address.

CaiSamuelsFSA Feb 6, 2024
Author

Thank you for the explanation, that makes a lot of sense.

Just thinking, would it be possible to identify paragraphs based on the amount of space between the lines? So if the space between 2 text lines is larger than a certain amount, we can assume it is a new paragraph?

If so, how could this be implemented?

JorjMcKie Feb 6, 2024
Maintainer

You can do that. But please be aware that any assumptions taken in this case may prove invalid with another document.
I have been successful often by looking at inter-line distances within a block (i.e. the difference between their y1 values). Then cross check with line distances between different blocks.
If then a certain threshold (1.5 for example) was exceeded, assume a new paragraph and more stuff like that.

CaiSamuelsFSA Feb 7, 2024
Author

Thank you for confirming, it seems to be working pretty well so far.

One thing I'm struggling with is how to identify if a paragraph continues on the next page. What I've come up with so far is checking if the text on the previous page ends with a period (.), and if it does, we can assume that the paragraph has ended. However, there could be cases where the period only signifies the end of the sentence, and the paragraph still continues on the next page.

Is there a better way of identifying if the paragraph continues on the next page?

JorjMcKie Feb 7, 2024
Maintainer

Cannot of something that is stable across multiple PDF examples.
As always in this area of semantical analysis, the only way to better and more stable results is categorizing input files.
Then first determine the type of a given file and then activate the best fitting logic ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Identifying paragraphs in PDF files #3133

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 8 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Identifying paragraphs in PDF files #3133

Uh oh!

CaiSamuelsFSA Feb 5, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Replies: 2 comments · 8 replies

Uh oh!

JorjMcKie Feb 5, 2024 Maintainer

Uh oh!

JorjMcKie Feb 5, 2024 Maintainer

Uh oh!

JorjMcKie Feb 6, 2024 Maintainer

Uh oh!

CaiSamuelsFSA Feb 6, 2024 Author

Uh oh!

JorjMcKie Feb 6, 2024 Maintainer

Uh oh!

CaiSamuelsFSA Feb 7, 2024 Author

Uh oh!

JorjMcKie Feb 7, 2024 Maintainer

CaiSamuelsFSA
Feb 5, 2024

Replies: 2 comments 8 replies

JorjMcKie
Feb 5, 2024
Maintainer

JorjMcKie
Feb 5, 2024
Maintainer

JorjMcKie Feb 6, 2024
Maintainer

CaiSamuelsFSA Feb 6, 2024
Author

JorjMcKie Feb 6, 2024
Maintainer

CaiSamuelsFSA Feb 7, 2024
Author

JorjMcKie Feb 7, 2024
Maintainer