Skip to content

fix(docx): split multiple OMML equations into separate formula items#3123

Open
giulio-leone wants to merge 6 commits intodocling-project:mainfrom
giulio-leone:fix/omml-multi-equation-paragraph
Open

fix(docx): split multiple OMML equations into separate formula items#3123
giulio-leone wants to merge 6 commits intodocling-project:mainfrom
giulio-leone:fix/omml-multi-equation-paragraph

Conversation

@giulio-leone
Copy link
Contributor

Summary

When a DOCX paragraph contains multiple sibling <m:oMath> elements (e.g. two separate equations on one line), the converter concatenated them into a single LaTeX string because element.iter() walks all descendants depth-first, mixing children from different oMath nodes.

Root Cause

_handle_equations_in_text() used element.iter() (deep iteration) to collect both text runs and math elements. With multiple sibling <m:oMath> elements:

<w:p>
</w:p>

iter() would visit the children of the first oMath AND the second oMath and its children — all interleaved. The result was a single concatenated equation string.

Fix

  1. Direct-children-first iteration: Check for oMath elements at the direct child level. If found, iterate direct children only, converting each oMath sibling independently. Falls back to the original deep iteration when oMath elements are nested inside wrapper elements like oMathPara.

  2. Split standalone multi-equation paragraphs: When a paragraph contains only equations (no surrounding text) and has more than one equation, each is now emitted as a separate FORMULA document item instead of merging into one.

Before / After

Before: A paragraph with equations E = mc^2 and F = ma produced:

FORMULA: "E = mc^2 F = ma"

After: Two separate items:

FORMULA: "E = mc^2"
FORMULA: "F = ma"

Closes #3121

@github-actions
Copy link
Contributor

github-actions bot commented Mar 13, 2026

DCO Check Failed

Hi @giulio-leone, your pull request has failed the Developer Certificate of Origin (DCO) check.

This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format.


🛠 Quick Fix: Add a remediation commit

Run this command:

git commit --allow-empty -s -m "DCO Remediation Commit for giulio-leone <giulio97.leone@gmail.com>

I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: 7f3faed3d82a6d37787d4a27ba129f077a777ba5"
git push

🔧 Advanced: Sign off each commit directly

For the latest commit:

git commit --amend --signoff
git push --force-with-lease

For multiple commits:

git rebase --signoff origin/main
git push --force-with-lease

More info: DCO check report

@mergify
Copy link

mergify bot commented Mar 13, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@giulio-leone giulio-leone force-pushed the fix/omml-multi-equation-paragraph branch from 20430c3 to f98d86f Compare March 13, 2026 05:31
@dolfim-ibm
Copy link
Member

@giulio-leone can you please add the document attached to the linked issue as a test?

@codecov
Copy link

codecov bot commented Mar 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@giulio-leone
Copy link
Contributor Author

@dolfim-ibm Done — added tests/data/docx/omml_multi_equation_paragraph.docx (a minimal DOCX with two sibling oMath elements separated by a text run) along with matching groundtruth files (md, json, itxt).

The test document validates that the fix correctly splits the equations into two separate FormulaItem entries instead of concatenating them.

@giulio-leone
Copy link
Contributor Author

Pushed a follow-up CI fix. The new fixture itself was fine, but I had generated the .itxt snapshot with the wrong exporter. I regenerated omml_multi_equation_paragraph.docx.itxt using the same _export_to_indented_text(max_text_len=70, explicit_tables=False) path that test_backend_msword.py actually validates.

@M-Hassan-Raza
Copy link
Contributor

Thanks for putting this together. The fix direction looks right, but I don’t think the new fixture is covering the exact failing shape from #3121.

The issue is specificlly about a single paragraph made up only of sibling m:oMath elements, with no text runs between them. On current main, that case still collapses into one display block and the equation ordder gets scrambled. The fixture added here looks more like formula-text-formula inline content, which current main already seems to handle correctly.

I’d suggest adding a regression fixture that matches the issue attachment more directly: one paragraph, multiple sibling m:oMath nodes, no intervening text. That would make it much clearer that this PR is locking down the reported bug and not a nearby case.

PSA: I am new to this coedbase so I could be wrong, in which case please feel free to discard this comment.

@giulio-leone giulio-leone force-pushed the fix/omml-multi-equation-paragraph branch from 29eab68 to f0c772c Compare March 15, 2026 16:18
@giulio-leone
Copy link
Contributor Author

Hi @dolfim-ibm, @M-Hassan-Raza — thanks for the detailed feedback! I'll add the test document from the linked issue and update the fixture to properly cover the exact failing shape (sibling m:oMath elements with no text runs between them). Will push an update shortly.

@giulio-leone
Copy link
Contributor Author

Thanks @dolfim-ibm @M-Hassan-Raza for the feedback!

I've now:

  1. Replaced the test document with the real Word file from issue Multiple OMML equations in one paragraph concatenated into a single display block #3121 (the ~37 KB document from @smroels containing three sibling <m:oMath> elements in one paragraph)
  2. Regenerated all groundtruth files for the new document

The conversion correctly produces three separate equation blocks:

$$a=b$$
$$c=d$$
$$e=f$$

Ready for re-review!

giulio-leone and others added 5 commits March 15, 2026 21:39
When a DOCX paragraph contains multiple sibling <m:oMath> elements
(e.g. separate equations on one line), the converter previously
concatenated them into a single LaTeX string because element.iter()
walks all descendants depth-first.

Fix: iterate direct children of the paragraph element first to
correctly identify sibling <m:oMath> elements, converting each
independently. Falls back to deep iteration only when oMath
elements are nested inside wrapper elements.

Also splits standalone multi-equation paragraphs into individual
FORMULA document items instead of merging them into one.

Closes docling-project#3121

Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Add a minimal DOCX file containing two separate oMath elements
in one paragraph with a text separator, along with groundtruth
output files for markdown, json, and plain text export.

Requested-by: @dolfim-ibm
Signed-off-by: Giulio Leone <giulioleone10@gmail.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com>
Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Use the real Word document from the issue reporter (smroels)
instead of the minimal programmatic fixture. The new document
contains three sibling <m:oMath> elements in one paragraph,
matching the exact failing shape described in docling-project#3121.

Regenerate groundtruth to match the richer document structure.

Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Re-run document conversion with current code to update .itxt and .json
groundtruth files. The .itxt had stale structure from the previous
programmatic fixture; the new real-document conversion produces the
correct output with three separate formula items.

Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
@giulio-leone giulio-leone force-pushed the fix/omml-multi-equation-paragraph branch from 2375409 to 277b980 Compare March 15, 2026 20:43
@giulio-leone
Copy link
Contributor Author

Hi team! 👋 The mergify bot indicates this PR requires two reviewers for test updates. Could a second reviewer (@PeterStaar-IBM, @cau-git, or @ceberam) take a look when convenient? The DCO check is now passing and all groundtruth files have been regenerated. Thank you!

@cau-git cau-git changed the title fix(msword): split multiple OMML equations into separate formula items fix(docx): split multiple OMML equations into separate formula items Mar 16, 2026
@cau-git
Copy link
Member

cau-git commented Mar 17, 2026

@giulio-leone Thanks for taking care of the feedback. Could you please re-run your pre-commit toolchain to ensure the tests pass?

uv run pre-commit install # only once in your dev setup
uv run pre-commit run --all-files # you can make a new commit and it will do this for you automatically.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@giulio-leone
Copy link
Contributor Author

I reran the requested formatting/tooling pass and pushed the result.

Local verification

  • uv run --python 3.12 pre-commit run --all-files
  • uv run --python 3.12 pytest tests/test_backend_msword.py -q ✅ (17 passed, 1 xfailed, 1 xpassed)

Real DOCX proof on the issue fixture

I also re-ran the actual conversion on tests/data/docx/omml_multi_equation_paragraph.docx and compared origin/main against this PR branch.

  • origin/main produced 1 concatenated formula item: c=de=fa=b
  • this PR branch produced 3 separate formula items: a=b, c=d, e=f

The markdown export matches that behavior as well:

  • origin/main => one $$c=de=fa=b$$ block
  • this PR => three separate formula blocks

The only new commit on the branch is the formatter rerun requested by CI / review:

  • 7f3faed style(docx): rerun ruff formatter for msword backend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multiple OMML equations in one paragraph concatenated into a single display block

4 participants