-
Notifications
You must be signed in to change notification settings - Fork 102
Open
Labels
bugSomething isn't workingSomething isn't working
Description
This issue shows gaps in the serialization of inline groups:
- Inline groups within rich table cells get serialized with a new paragraph (in HTML) for each group item. For instance, check the serialization of the test
docxfile docx_rich_cells.docx to HTML. - The serializations of inline groups add a blank space before and after every item of the group.
Even though some applications may deal with double blank spaces, the presentation of the text may be altered and eventually introduce new tokens. This is the case of inline groups created for addressing formatted text.
For instance, consider the text:
Docling supports bold, italic, strikethrough, underline, and formulas like H20.
with some applied formatting as in the following DoclingDocument, created programmatically:
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import Formatting
from docling_core.types.doc.labels import DocItemLabel
doc = DoclingDocument(name="Test")
inline = doc.add_inline_group(parent=None)
doc.add_text(label=DocItemLabel.TEXT, text="Docling supports ", parent=inline)
doc.add_text(label=DocItemLabel.TEXT, text="italic", parent=inline, formatting=Formatting(italic=True))
doc.add_text(label=DocItemLabel.TEXT, text=", ", parent=inline)
doc.add_text(label=DocItemLabel.TEXT, text="strikethrough", parent=inline, formatting=Formatting(strikethrough=True))
doc.add_text(label=DocItemLabel.TEXT, text=", ", parent=inline)
doc.add_text(label=DocItemLabel.TEXT, text="underline", parent=inline, formatting=Formatting(underline=True))
doc.add_text(label=DocItemLabel.TEXT, text=", subscripts like H", parent=inline)
doc.add_text(label=DocItemLabel.TEXT, text="2", parent=inline, formatting=Formatting(script="sub"))
doc.add_text(label=DocItemLabel.TEXT, text="0, and ", parent=inline)
doc.add_text(label=DocItemLabel.TEXT, text="bold", parent=inline, formatting=Formatting(bold=True))
doc.add_text(label=DocItemLabel.TEXT, text=".", parent=inline)
doc.export_to_markdown()
doc.export_to_html()The serialization to markdown shows these extra spaces:
Docling supports *italic* , ~~strikethrough~~ , underline , subscripts like H 2 0, and **bold** .
Also to HTML:
<html>
<body>
<div class='page'>
<span class='inline-group'>Docling supports <em>italic</em> , <del>strikethrough</del> , <u>underline</u> , subscripts like H <sub>2</sub> 0, and <strong>bold</strong> .</span>
</div>
</body>
</html>Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working