HTML output throws away text block information

## Describe the bug (mandatory)
Converting a PDF to HTML throws away part of the document structure.

Other outputs types (e.g. `dict`) produce the following structure:

- page
  - text block
    - line
      - span

But HTML has the following:

- page
    - line
      - span

## To Reproduce (mandatory)

This happens on any PDF, but I'm using [lorem-two-para.pdf](https://github.com/pymupdf/PyMuPDF/files/7038515/lorem-two-para.pdf) that looks like this:

> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. 
> 
> Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Calling things like `get_text('dict')` and `get_text('blocks')` shows that PyMuPDF correctly interprets the document as having two text blocks/paragraphs:

```python
[
    (56.76000213623047, 70.90341186523438, 524.3014526367188, 96.03419494628906,
     'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna \naliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. \n',
     0, 0),
    (56.76000213623047, 121.30341339111328, 503.4024658203125, 146.43418884277344,
     'Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint \noccaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n',
     1, 0)
]
```

But `get_text('html')` looks like this (note that I've removed the styling data for brevity):

```html
<div id="page0">
    <p>
        <span>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna </span>
    </p>
    <p>
        <span>aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </span>
    </p>
    <p>
        <span>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint </span>
    </p>
    <p>
        <span>occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</span>
    </p>
</div>
```

The output flattens the lines (represented `p` tags) and there's no way of knowing which lines belong to the same text blocks. This is a problem for me as I need to post-process the HTML output in a way that requires the block information.

## Expected behavior (optional)

Ideally the HTML output would (at least by way of an optional argument) include the text block part of structure as an element, e.g.

```html
<div id="page0">
    <div id="page0block0">
        <p>
            <span>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna </span>
        </p>
        <p>
            <span>aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </span>
        </p>
    </div>
    <div id="page0block1">
        <p>
            <span>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint </span>
        </p>
        <p>
            <span>occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</span>
        </p>
    </div>
</div>
```
## Your configuration (mandatory)
```console
3.9.4 (default, Apr 10 2021, 15:31:19)
[GCC 8.3.0]
 linux

PyMuPDF 1.18.16: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-08-05 00:00:01.
Built for Python 3.9 on linux (64-bit).
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HTML output throws away text block information #1228

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Your configuration (mandatory)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HTML output throws away text block information #1228

Description

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Your configuration (mandatory)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions