Skip to content

HTML output throws away text block information #1228

@ejohb

Description

@ejohb

Describe the bug (mandatory)

Converting a PDF to HTML throws away part of the document structure.

Other outputs types (e.g. dict) produce the following structure:

  • page
    • text block
      • line
        • span

But HTML has the following:

  • page
    • line
      • span

To Reproduce (mandatory)

This happens on any PDF, but I'm using lorem-two-para.pdf that looks like this:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Calling things like get_text('dict') and get_text('blocks') shows that PyMuPDF correctly interprets the document as having two text blocks/paragraphs:

[
    (56.76000213623047, 70.90341186523438, 524.3014526367188, 96.03419494628906,
     'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna \naliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. \n',
     0, 0),
    (56.76000213623047, 121.30341339111328, 503.4024658203125, 146.43418884277344,
     'Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint \noccaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n',
     1, 0)
]

But get_text('html') looks like this (note that I've removed the styling data for brevity):

<div id="page0">
    <p>
        <span>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna </span>
    </p>
    <p>
        <span>aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </span>
    </p>
    <p>
        <span>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint </span>
    </p>
    <p>
        <span>occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</span>
    </p>
</div>

The output flattens the lines (represented p tags) and there's no way of knowing which lines belong to the same text blocks. This is a problem for me as I need to post-process the HTML output in a way that requires the block information.

Expected behavior (optional)

Ideally the HTML output would (at least by way of an optional argument) include the text block part of structure as an element, e.g.

<div id="page0">
    <div id="page0block0">
        <p>
            <span>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna </span>
        </p>
        <p>
            <span>aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </span>
        </p>
    </div>
    <div id="page0block1">
        <p>
            <span>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint </span>
        </p>
        <p>
            <span>occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</span>
        </p>
    </div>
</div>

Your configuration (mandatory)

3.9.4 (default, Apr 10 2021, 15:31:19)
[GCC 8.3.0]
 linux

PyMuPDF 1.18.16: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-08-05 00:00:01.
Built for Python 3.9 on linux (64-bit).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions