Skip to content

Commit 90d6dd4

Browse files
giulio-leonegiulio-leoneCopilot
authored
fix(docx): split multiple OMML equations into separate formula items (#3123)
* fix(msword): split multiple OMML equations into separate formula items When a DOCX paragraph contains multiple sibling <m:oMath> elements (e.g. separate equations on one line), the converter previously concatenated them into a single LaTeX string because element.iter() walks all descendants depth-first. Fix: iterate direct children of the paragraph element first to correctly identify sibling <m:oMath> elements, converting each independently. Falls back to deep iteration only when oMath elements are nested inside wrapper elements. Also splits standalone multi-equation paragraphs into individual FORMULA document items instead of merging them into one. Closes #3121 Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test(msword): add multi-equation paragraph test document Add a minimal DOCX file containing two separate oMath elements in one paragraph with a text separator, along with groundtruth output files for markdown, json, and plain text export. Requested-by: @dolfim-ibm Signed-off-by: Giulio Leone <giulioleone10@gmail.com> Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test(msword): regenerate multi-equation indented-text snapshot Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test: replace test doc with issue #3121 attachment Use the real Word document from the issue reporter (smroels) instead of the minimal programmatic fixture. The new document contains three sibling <m:oMath> elements in one paragraph, matching the exact failing shape described in #3121. Regenerate groundtruth to match the richer document structure. Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test: regenerate groundtruth for omml_multi_equation_paragraph Re-run document conversion with current code to update .itxt and .json groundtruth files. The .itxt had stale structure from the previous programmatic fixture; the new real-document conversion produces the correct output with three separate formula items. Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * style(docx): rerun ruff formatter for msword backend Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * refactor(docx): drop unused tag_name binding Remove the unused local in the direct oMath iteration path so the code reads clearly and the outstanding review comment is fully addressed without changing equation-handling behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * DCO Remediation Commit for giulio-leone <giulio97.leone@gmail.com> I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: 84cc70b Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test(docx): cover equation paragraph branches Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> * test(docx): reuse backend fixture in msword tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> --------- Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com> Signed-off-by: Giulio Leone <giulioleone10@gmail.com> Co-authored-by: giulio-leone <giulio.leone@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent fdf5e20 commit 90d6dd4

File tree

6 files changed

+352
-23
lines changed

6 files changed

+352
-23
lines changed

docling/backend/msword_backend.py

Lines changed: 70 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -906,21 +906,55 @@ def _handle_equations_in_text(self, element, text):
906906
only_texts = []
907907
only_equations = []
908908
texts_and_equations = []
909-
for subt in element.iter():
910-
tag_name = etree.QName(subt).localname
911-
if tag_name == "t" and "math" not in subt.tag:
912-
if isinstance(subt.text, str):
913-
only_texts.append(subt.text)
914-
texts_and_equations.append(subt.text)
915-
elif "oMath" in subt.tag and "oMathPara" not in subt.tag:
916-
latex_equation = str(oMath2Latex(subt)).strip()
917-
if len(latex_equation) > 0:
918-
only_equations.append(
919-
self.equation_bookends.format(EQ=latex_equation)
920-
)
921-
texts_and_equations.append(
922-
self.equation_bookends.format(EQ=latex_equation)
923-
)
909+
910+
# Collect oMath elements and text runs from the paragraph.
911+
# Use direct children iteration first; fall back to deep iteration
912+
# only if no oMath elements are found at the direct level.
913+
direct_omaths = [
914+
child
915+
for child in element
916+
if "oMath" in child.tag and "oMathPara" not in child.tag
917+
]
918+
919+
if direct_omaths:
920+
# Iterate direct children to preserve sibling order and avoid
921+
# processing nested oMath descendants of an already-converted node.
922+
for child in element:
923+
if "oMath" in child.tag and "oMathPara" not in child.tag:
924+
latex_equation = str(oMath2Latex(child)).strip()
925+
if len(latex_equation) > 0:
926+
only_equations.append(
927+
self.equation_bookends.format(EQ=latex_equation)
928+
)
929+
texts_and_equations.append(
930+
self.equation_bookends.format(EQ=latex_equation)
931+
)
932+
else:
933+
# Collect text from non-math children (e.g. <w:r> runs)
934+
for t_elem in child.iter():
935+
t_tag = etree.QName(t_elem).localname
936+
if t_tag == "t" and "math" not in t_elem.tag:
937+
if isinstance(t_elem.text, str):
938+
only_texts.append(t_elem.text)
939+
texts_and_equations.append(t_elem.text)
940+
else:
941+
# Original deep-iteration fallback for nested oMath (e.g.
942+
# inside oMathPara or other wrapper elements).
943+
for subt in element.iter():
944+
tag_name = etree.QName(subt).localname
945+
if tag_name == "t" and "math" not in subt.tag:
946+
if isinstance(subt.text, str):
947+
only_texts.append(subt.text)
948+
texts_and_equations.append(subt.text)
949+
elif "oMath" in subt.tag and "oMathPara" not in subt.tag:
950+
latex_equation = str(oMath2Latex(subt)).strip()
951+
if len(latex_equation) > 0:
952+
only_equations.append(
953+
self.equation_bookends.format(EQ=latex_equation)
954+
)
955+
texts_and_equations.append(
956+
self.equation_bookends.format(EQ=latex_equation)
957+
)
924958

925959
if len(only_equations) < 1:
926960
return text, []
@@ -1055,15 +1089,28 @@ def _handle_text_elements(
10551089
if (paragraph.text is None or len(paragraph.text.strip()) == 0) and len(
10561090
text
10571091
) > 0:
1058-
# Standalone equation
1092+
# Standalone equation(s) — emit each as a separate formula
10591093
level = self._get_level()
1060-
t1 = doc.add_text(
1061-
label=DocItemLabel.FORMULA,
1062-
parent=self.parents[level - 1],
1063-
text=text.replace("<eq>", "").replace("</eq>", ""),
1064-
content_layer=self.content_layer,
1065-
)
1066-
elem_ref.append(t1.get_ref())
1094+
parent = self.parents[level - 1]
1095+
if len(equations) > 1:
1096+
for eq in equations:
1097+
eq_text = eq.replace("<eq>", "").replace("</eq>", "").strip()
1098+
if len(eq_text) > 0:
1099+
t1 = doc.add_text(
1100+
label=DocItemLabel.FORMULA,
1101+
parent=parent,
1102+
text=eq_text,
1103+
content_layer=self.content_layer,
1104+
)
1105+
elem_ref.append(t1.get_ref())
1106+
else:
1107+
t1 = doc.add_text(
1108+
label=DocItemLabel.FORMULA,
1109+
parent=parent,
1110+
text=text.replace("<eq>", "").replace("</eq>", ""),
1111+
content_layer=self.content_layer,
1112+
)
1113+
elem_ref.append(t1.get_ref())
10671114
else:
10681115
# Inline equation
10691116
level = self._get_level()
36 KB
Binary file not shown.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
item-0 at level 0: unspecified: group _root_
2+
item-1 at level 1: section: group header-0
3+
item-2 at level 2: section_header: Issue 3: Concatenated equation blocks
4+
item-3 at level 3: text: The paragraph below contains thr ... ts are siblings inside a single <w:p>.
5+
item-4 at level 3: formula: a=b
6+
item-5 at level 3: formula: c=d
7+
item-6 at level 3: formula: e=f
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
{
2+
"schema_name": "DoclingDocument",
3+
"version": "1.9.0",
4+
"name": "omml_multi_equation_paragraph",
5+
"origin": {
6+
"mimetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
7+
"binary_hash": 17520448227351822398,
8+
"filename": "omml_multi_equation_paragraph.docx"
9+
},
10+
"furniture": {
11+
"self_ref": "#/furniture",
12+
"children": [],
13+
"content_layer": "furniture",
14+
"name": "_root_",
15+
"label": "unspecified"
16+
},
17+
"body": {
18+
"self_ref": "#/body",
19+
"children": [
20+
{
21+
"$ref": "#/groups/0"
22+
}
23+
],
24+
"content_layer": "body",
25+
"name": "_root_",
26+
"label": "unspecified"
27+
},
28+
"groups": [
29+
{
30+
"self_ref": "#/groups/0",
31+
"parent": {
32+
"$ref": "#/body"
33+
},
34+
"children": [
35+
{
36+
"$ref": "#/texts/0"
37+
}
38+
],
39+
"content_layer": "body",
40+
"name": "header-0",
41+
"label": "section"
42+
}
43+
],
44+
"texts": [
45+
{
46+
"self_ref": "#/texts/0",
47+
"parent": {
48+
"$ref": "#/groups/0"
49+
},
50+
"children": [
51+
{
52+
"$ref": "#/texts/1"
53+
},
54+
{
55+
"$ref": "#/texts/2"
56+
},
57+
{
58+
"$ref": "#/texts/3"
59+
},
60+
{
61+
"$ref": "#/texts/4"
62+
}
63+
],
64+
"content_layer": "body",
65+
"label": "section_header",
66+
"prov": [],
67+
"orig": "Issue 3: Concatenated equation blocks",
68+
"text": "Issue 3: Concatenated equation blocks",
69+
"level": 1
70+
},
71+
{
72+
"self_ref": "#/texts/1",
73+
"parent": {
74+
"$ref": "#/texts/0"
75+
},
76+
"children": [],
77+
"content_layer": "body",
78+
"label": "text",
79+
"prov": [],
80+
"orig": "The paragraph below contains three separate <m:oMath> elements.\nExpected: three separate $$ blocks ($$a = b$$, $$c = d$$, $$e = f$$)\nDocling produces: one $$ block with all equations concatenated.\n\nAll three <m:oMath> elements are siblings inside a single <w:p>.",
81+
"text": "The paragraph below contains three separate <m:oMath> elements.\nExpected: three separate $$ blocks ($$a = b$$, $$c = d$$, $$e = f$$)\nDocling produces: one $$ block with all equations concatenated.\n\nAll three <m:oMath> elements are siblings inside a single <w:p>.",
82+
"formatting": {
83+
"bold": false,
84+
"italic": false,
85+
"underline": false,
86+
"strikethrough": false,
87+
"script": "baseline"
88+
}
89+
},
90+
{
91+
"self_ref": "#/texts/2",
92+
"parent": {
93+
"$ref": "#/texts/0"
94+
},
95+
"children": [],
96+
"content_layer": "body",
97+
"label": "formula",
98+
"prov": [],
99+
"orig": "a=b",
100+
"text": "a=b"
101+
},
102+
{
103+
"self_ref": "#/texts/3",
104+
"parent": {
105+
"$ref": "#/texts/0"
106+
},
107+
"children": [],
108+
"content_layer": "body",
109+
"label": "formula",
110+
"prov": [],
111+
"orig": "c=d",
112+
"text": "c=d"
113+
},
114+
{
115+
"self_ref": "#/texts/4",
116+
"parent": {
117+
"$ref": "#/texts/0"
118+
},
119+
"children": [],
120+
"content_layer": "body",
121+
"label": "formula",
122+
"prov": [],
123+
"orig": "e=f",
124+
"text": "e=f"
125+
}
126+
],
127+
"pictures": [],
128+
"tables": [],
129+
"key_value_items": [],
130+
"form_items": [],
131+
"pages": {}
132+
}
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
## Issue 3: Concatenated equation blocks
2+
3+
The paragraph below contains three separate &lt;m:oMath&gt; elements.
4+
Expected: three separate $$ blocks ($$a = b$$, $$c = d$$, $$e = f$$)
5+
Docling produces: one $$ block with all equations concatenated.
6+
7+
All three &lt;m:oMath&gt; elements are siblings inside a single &lt;w:p&gt;.
8+
9+
$$a=b$$
10+
11+
$$c=d$$
12+
13+
$$e=f$$

0 commit comments

Comments
 (0)