Identifying paragraphs in PDF files #3133
Unanswered
CaiSamuelsFSA
asked this question in
Q&A
Replies: 2 comments 8 replies
-
This is no issue but a Discussions item - transferring ... |
Beta Was this translation helpful? Give feedback.
0 replies
-
Sorry to say that, but the MuPDF-generated hierarchy |
Beta Was this translation helpful? Give feedback.
8 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Description of the bug
I am trying to extract paragraphs of text from PDF files, and as far as I'm aware, blocks seem to be the only way to identify paragraphs in PyMuPDF. The problem is that blocks often cut off paragraphs and separate them into multiple blocks.
How to reproduce the bug
Paragraphs are being cut off when extracting text in blocks.
PyMuPDF version
1.23.21
Operating system
Windows
Python version
3.11
Beta Was this translation helpful? Give feedback.
All reactions