Set space threshold to decide text blocks #1358
-
is there a way to set when the library decides if the text is within the same text block or not based on some kind of spacing threshold ? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 4 replies
-
No, this is decided inside MuPDF code, with no way to influence it. |
Beta Was this translation helpful? Give feedback.
-
Is there any known way in PyMuPDF to split the MuPDF-detected blocks into smaller blocks based on some heuristic? I have a PDF of a dictionary that uses a 2-column page layout with very little space between columns, such that MuPDF and PyMuPDF are treating them as a single block. I'd like to have some way to split the block vertically based on a heuristic that can find the column break better than the out-of-the-box MuPDF algorithm seems to be able to. Is there a known way of doing this split in PyMuPDF, to split the block found by MuPDF? |
Beta Was this translation helpful? Give feedback.
-
In my experience, each time that a PDF is generated by print-to-pdf from HTML each line is detected as a separated block. I have tried with different options. In my opinion, this is not anecdotal, as print-to-pdf is often the most convenient way to save a dynamic web page. |
Beta Was this translation helpful? Give feedback.
-
I have added an exception in my PDF processing pipeline:
So most documents (at least 3/4) are processed with pymupdf in block mode as this typically works best. However, this does not work when the producer is cairo (1/4 of the documents). So in such cases I use pdftotext from xpdf as, amongst the numerous tools I have tried, it is the only one that correctly identifies the paragraphs (and only if the -fixed flag is used). |
Beta Was this translation helpful? Give feedback.
No, this is decided inside MuPDF code, with no way to influence it.