Skip to content

Commit 834db4b

Browse files
authored
fix(markdown): fix escaping in case of nesting (#180)
Signed-off-by: Panos Vagenas <[email protected]>
1 parent 419252c commit 834db4b

18 files changed

+34
-24
lines changed

docling_core/types/doc/document.py

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2487,7 +2487,6 @@ def _ingest_text(text: str, do_escape_html=True, do_escape_underscores=True):
24872487
is_inline_scope=is_inline_scope,
24882488
visited=visited,
24892489
)
2490-
# NOTE: assumes unordered (flag & marker currently in ListItem)
24912490
indent_str = list_level * indent * " "
24922491
is_ol = item.label == GroupLabel.ORDERED_LIST
24932492
text = "\n".join(
@@ -2501,7 +2500,12 @@ def _ingest_text(text: str, do_escape_html=True, do_escape_underscores=True):
25012500
for i, c in enumerate(comps)
25022501
]
25032502
)
2504-
_ingest_text(text=text)
2503+
_ingest_text(
2504+
text=text,
2505+
# special chars have already been escaped as needed
2506+
do_escape_html=False,
2507+
do_escape_underscores=False,
2508+
)
25052509
elif item.label == GroupLabel.INLINE:
25062510
comps = self._get_markdown_components(
25072511
node=item,
@@ -2520,7 +2524,13 @@ def _ingest_text(text: str, do_escape_html=True, do_escape_underscores=True):
25202524
is_inline_scope=True,
25212525
visited=visited,
25222526
)
2523-
_ingest_text(" ".join(comps))
2527+
text = " ".join(comps)
2528+
_ingest_text(
2529+
text=text,
2530+
# special chars have already been escaped as needed
2531+
do_escape_html=False,
2532+
do_escape_underscores=False,
2533+
)
25242534
else:
25252535
continue
25262536

test/data/doc/2206.01062.yaml.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ In this paper, we present the DocLayNet dataset. It provides pageby-page layout
5555

5656
This enables experimentation with annotation uncertainty and quality control analysis.
5757

58-
- (5) Pre-defined Train-, Test- &amp;amp; Validation-set : Like DocBank, we provide fixed train-, test- &amp;amp; validation-sets to ensure proportional representation of the class-labels. Further, we prevent leakage of unique layouts across sets, which has a large effect on model accuracy scores.
58+
- (5) Pre-defined Train-, Test- &amp; Validation-set : Like DocBank, we provide fixed train-, test- &amp; validation-sets to ensure proportional representation of the class-labels. Further, we prevent leakage of unique layouts across sets, which has a large effect on model accuracy scores.
5959

6060
All aspects outlined above are detailed in Section 3. In Section 4, we will elaborate on how we designed and executed this large-scale human annotation campaign. We will also share key insights and lessons learned that might prove helpful for other parties planning to set up annotation campaigns.
6161

test/data/doc/constructed_doc.dt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Affiliation 2</text>
3030
<list_item>item 2 of neighboring list</list_item>
3131
<unordered_list><list_item>item 1 of sub list</list_item>
3232
<paragraph>Here a code snippet:</paragraph>
33-
<code<_unknown_>print("Hello world")</code
33+
<code<_unknown_><p>Hello world</p></code
3434
<paragraph>(to be displayed inline)</paragraph>
3535
</unordered_list>
3636
<paragraph>Here a formula:</paragraph>

test/data/doc/constructed_doc.dt.gt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Affiliation 2</text>
3030
<list_item>item 2 of neighboring list</list_item>
3131
<unordered_list><list_item>item 1 of sub list</list_item>
3232
<paragraph>Here a code snippet:</paragraph>
33-
<code<_unknown_>print("Hello world")</code
33+
<code<_unknown_><p>Hello world</p></code
3434
<paragraph>(to be displayed inline)</paragraph>
3535
</unordered_list>
3636
<paragraph>Here a formula:</paragraph>

test/data/doc/constructed_doc.embedded.html.gt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@
108108
<ul>
109109
<li>item 1 of sub list</li>
110110
<p>Here a code snippet:</p>
111-
<pre><code>print("Hello world")</code></pre>
111+
<pre><code><p>Hello world</p></code></pre>
112112
<p>(to be displayed inline)</p>
113113
</ul>
114114
<p>Here a formula:</p>

test/data/doc/constructed_doc.embedded.json.gt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -636,8 +636,8 @@
636636
"content_layer": "body",
637637
"label": "code",
638638
"prov": [],
639-
"orig": "print(\"Hello world\")",
640-
"text": "print(\"Hello world\")",
639+
"orig": "<p>Hello world</p>",
640+
"text": "<p>Hello world</p>",
641641
"captions": [],
642642
"references": [],
643643
"footnotes": [],

test/data/doc/constructed_doc.embedded.md.gt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ This is the caption of figure 2.
4444
- item 1 of neighboring list
4545
- item 2 of neighboring list
4646
- item 1 of sub list
47-
- Here a code snippet: `print("Hello world")` (to be displayed inline)
47+
- Here a code snippet: `<p>Hello world</p>` (to be displayed inline)
4848
- Here a formula: $E=mc^2$ (to be displayed inline)
4949

5050
Here a code block:

test/data/doc/constructed_doc.embedded.yaml.gt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -676,13 +676,13 @@ texts:
676676
content_layer: body
677677
footnotes: []
678678
label: code
679-
orig: print("Hello world")
679+
orig: <p>Hello world</p>
680680
parent:
681681
$ref: '#/groups/10'
682682
prov: []
683683
references: []
684684
self_ref: '#/texts/24'
685-
text: print("Hello world")
685+
text: <p>Hello world</p>
686686
- children: []
687687
content_layer: body
688688
label: paragraph

test/data/doc/constructed_doc.placeholder.html.gt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@
108108
<ul>
109109
<li>item 1 of sub list</li>
110110
<p>Here a code snippet:</p>
111-
<pre><code>print("Hello world")</code></pre>
111+
<pre><code><p>Hello world</p></code></pre>
112112
<p>(to be displayed inline)</p>
113113
</ul>
114114
<p>Here a formula:</p>

test/data/doc/constructed_doc.placeholder.md.gt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ This is the caption of figure 2.
4444
- item 1 of neighboring list
4545
- item 2 of neighboring list
4646
- item 1 of sub list
47-
- Here a code snippet: `print("Hello world")` (to be displayed inline)
47+
- Here a code snippet: `<p>Hello world</p>` (to be displayed inline)
4848
- Here a formula: $E=mc^2$ (to be displayed inline)
4949

5050
Here a code block:

0 commit comments

Comments
 (0)