-
Notifications
You must be signed in to change notification settings - Fork 678
Description
Describe the bug (mandatory)
Converting a PDF to HTML throws away part of the document structure.
Other outputs types (e.g. dict) produce the following structure:
- page
- text block
- line
- span
- line
- text block
But HTML has the following:
- page
- line
- span
- line
To Reproduce (mandatory)
This happens on any PDF, but I'm using lorem-two-para.pdf that looks like this:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Calling things like get_text('dict') and get_text('blocks') shows that PyMuPDF correctly interprets the document as having two text blocks/paragraphs:
[
(56.76000213623047, 70.90341186523438, 524.3014526367188, 96.03419494628906,
'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna \naliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. \n',
0, 0),
(56.76000213623047, 121.30341339111328, 503.4024658203125, 146.43418884277344,
'Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint \noccaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n',
1, 0)
]But get_text('html') looks like this (note that I've removed the styling data for brevity):
<div id="page0">
<p>
<span>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna </span>
</p>
<p>
<span>aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </span>
</p>
<p>
<span>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint </span>
</p>
<p>
<span>occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</span>
</p>
</div>The output flattens the lines (represented p tags) and there's no way of knowing which lines belong to the same text blocks. This is a problem for me as I need to post-process the HTML output in a way that requires the block information.
Expected behavior (optional)
Ideally the HTML output would (at least by way of an optional argument) include the text block part of structure as an element, e.g.
<div id="page0">
<div id="page0block0">
<p>
<span>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna </span>
</p>
<p>
<span>aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </span>
</p>
</div>
<div id="page0block1">
<p>
<span>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint </span>
</p>
<p>
<span>occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</span>
</p>
</div>
</div>Your configuration (mandatory)
3.9.4 (default, Apr 10 2021, 15:31:19)
[GCC 8.3.0]
linux
PyMuPDF 1.18.16: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-08-05 00:00:01.
Built for Python 3.9 on linux (64-bit).