Skip to content

Serialization of inline groups adds extra spaces before and after group items #371

@ceberam

Description

@ceberam

This issue shows gaps in the serialization of inline groups:

  • Inline groups within rich table cells get serialized with a new paragraph (in HTML) for each group item. For instance, check the serialization of the test docx file docx_rich_cells.docx to HTML.
  • The serializations of inline groups add a blank space before and after every item of the group.
    Even though some applications may deal with double blank spaces, the presentation of the text may be altered and eventually introduce new tokens. This is the case of inline groups created for addressing formatted text.

For instance, consider the text:

Docling supports bold, italic, strikethrough, underline, and formulas like H20.

with some applied formatting as in the following DoclingDocument, created programmatically:

from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import Formatting
from docling_core.types.doc.labels import DocItemLabel

doc = DoclingDocument(name="Test")
inline = doc.add_inline_group(parent=None)
doc.add_text(label=DocItemLabel.TEXT, text="Docling supports ", parent=inline)
doc.add_text(label=DocItemLabel.TEXT, text="italic", parent=inline, formatting=Formatting(italic=True))
doc.add_text(label=DocItemLabel.TEXT, text=", ", parent=inline)
doc.add_text(label=DocItemLabel.TEXT, text="strikethrough", parent=inline, formatting=Formatting(strikethrough=True))
doc.add_text(label=DocItemLabel.TEXT, text=", ", parent=inline)
doc.add_text(label=DocItemLabel.TEXT, text="underline", parent=inline, formatting=Formatting(underline=True))
doc.add_text(label=DocItemLabel.TEXT, text=", subscripts like H", parent=inline)
doc.add_text(label=DocItemLabel.TEXT, text="2", parent=inline, formatting=Formatting(script="sub"))
doc.add_text(label=DocItemLabel.TEXT, text="0, and ", parent=inline)
doc.add_text(label=DocItemLabel.TEXT, text="bold", parent=inline, formatting=Formatting(bold=True))
doc.add_text(label=DocItemLabel.TEXT, text=".", parent=inline)

doc.export_to_markdown()
doc.export_to_html()

The serialization to markdown shows these extra spaces:

Docling supports  *italic* ,  ~~strikethrough~~ ,  underline , subscripts like H 2 0, and  **bold** .

Also to HTML:

<html>
<body>
<div class='page'>
<span class='inline-group'>Docling supports  <em>italic</em> ,  <del>strikethrough</del> ,  <u>underline</u> , subscripts like H <sub>2</sub> 0, and  <strong>bold</strong> .</span>
</div>
</body>
</html>

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions