fix(docx): split multiple OMML equations into separate formula items#3123
fix(docx): split multiple OMML equations into separate formula items#3123giulio-leone wants to merge 6 commits intodocling-project:mainfrom
Conversation
|
❌ DCO Check Failed Hi @giulio-leone, your pull request has failed the Developer Certificate of Origin (DCO) check. This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format. 🛠 Quick Fix: Add a remediation commitRun this command: git commit --allow-empty -s -m "DCO Remediation Commit for giulio-leone <giulio97.leone@gmail.com>
I, giulio-leone <giulio97.leone@gmail.com>, hereby add my Signed-off-by to this commit: 7f3faed3d82a6d37787d4a27ba129f077a777ba5"
git push🔧 Advanced: Sign off each commit directlyFor the latest commit: git commit --amend --signoff
git push --force-with-leaseFor multiple commits: git rebase --signoff origin/main
git push --force-with-leaseMore info: DCO check report |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
20430c3 to
f98d86f
Compare
|
@giulio-leone can you please add the document attached to the linked issue as a test? |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
@dolfim-ibm Done — added The test document validates that the fix correctly splits the equations into two separate |
|
Pushed a follow-up CI fix. The new fixture itself was fine, but I had generated the |
|
Thanks for putting this together. The fix direction looks right, but I don’t think the new fixture is covering the exact failing shape from #3121. The issue is specificlly about a single paragraph made up only of sibling I’d suggest adding a regression fixture that matches the issue attachment more directly: one paragraph, multiple sibling PSA: I am new to this coedbase so I could be wrong, in which case please feel free to discard this comment. |
29eab68 to
f0c772c
Compare
|
Hi @dolfim-ibm, @M-Hassan-Raza — thanks for the detailed feedback! I'll add the test document from the linked issue and update the fixture to properly cover the exact failing shape (sibling |
|
Thanks @dolfim-ibm @M-Hassan-Raza for the feedback! I've now:
The conversion correctly produces three separate equation blocks: Ready for re-review! |
When a DOCX paragraph contains multiple sibling <m:oMath> elements (e.g. separate equations on one line), the converter previously concatenated them into a single LaTeX string because element.iter() walks all descendants depth-first. Fix: iterate direct children of the paragraph element first to correctly identify sibling <m:oMath> elements, converting each independently. Falls back to deep iteration only when oMath elements are nested inside wrapper elements. Also splits standalone multi-equation paragraphs into individual FORMULA document items instead of merging them into one. Closes docling-project#3121 Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Add a minimal DOCX file containing two separate oMath elements in one paragraph with a text separator, along with groundtruth output files for markdown, json, and plain text export. Requested-by: @dolfim-ibm Signed-off-by: Giulio Leone <giulioleone10@gmail.com> Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Signed-off-by: giulio-leone <giulio.leone@users.noreply.github.com> Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Use the real Word document from the issue reporter (smroels) instead of the minimal programmatic fixture. The new document contains three sibling <m:oMath> elements in one paragraph, matching the exact failing shape described in docling-project#3121. Regenerate groundtruth to match the richer document structure. Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Re-run document conversion with current code to update .itxt and .json groundtruth files. The .itxt had stale structure from the previous programmatic fixture; the new real-document conversion produces the correct output with three separate formula items. Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
2375409 to
277b980
Compare
|
Hi team! 👋 The mergify bot indicates this PR requires two reviewers for test updates. Could a second reviewer (@PeterStaar-IBM, @cau-git, or @ceberam) take a look when convenient? The DCO check is now passing and all groundtruth files have been regenerated. Thank you! |
|
@giulio-leone Thanks for taking care of the feedback. Could you please re-run your pre-commit toolchain to ensure the tests pass? |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
I reran the requested formatting/tooling pass and pushed the result. Local verification
Real DOCX proof on the issue fixtureI also re-ran the actual conversion on
The markdown export matches that behavior as well:
The only new commit on the branch is the formatter rerun requested by CI / review:
|
Summary
When a DOCX paragraph contains multiple sibling
<m:oMath>elements (e.g. two separate equations on one line), the converter concatenated them into a single LaTeX string becauseelement.iter()walks all descendants depth-first, mixing children from differentoMathnodes.Root Cause
_handle_equations_in_text()usedelement.iter()(deep iteration) to collect both text runs and math elements. With multiple sibling<m:oMath>elements:iter()would visit the children of the firstoMathAND the secondoMathand its children — all interleaved. The result was a single concatenated equation string.Fix
Direct-children-first iteration: Check for
oMathelements at the direct child level. If found, iterate direct children only, converting eachoMathsibling independently. Falls back to the original deep iteration whenoMathelements are nested inside wrapper elements likeoMathPara.Split standalone multi-equation paragraphs: When a paragraph contains only equations (no surrounding text) and has more than one equation, each is now emitted as a separate
FORMULAdocument item instead of merging into one.Before / After
Before: A paragraph with equations
E = mc^2andF = maproduced:After: Two separate items:
Closes #3121