fix(docx): Missing list items after numbered header (#2665)#2678
fix(docx): Missing list items after numbered header (#2665)#2678emreclsr wants to merge 6 commits intodocling-project:mainfrom
Conversation
|
✅ DCO Check Passed Thanks @emreclsr, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
DescriptionThis PR improves test coverage for the list-after-heading fix by adding test cases for enumerated (numbered) lists following headings, achieving 100% code coverage for the related changes. ChangesTest Data
Test Code
Groundtruth Files
|
|
I've tried this fix on one .docx file it falls with sigh |
|
@mkhludnev Thanks for testing could you share the .docx file (or a minimal reproducer) that triggers this KeyError? I'd like to debug and fix it. |
|
@ceberam I've checked out recently launch script failure |
|
Thanks @mkhludnev . It seems we hit another edge case. We should apply some safeguards to avoid the type of error you shared, in any case. Can you share the |
|
I noticed that the problematic item in Writer numbered as |
1b1e05d to
8154490
Compare
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
To help demonstrate and validate the reported issue, I prepared a test Word document specifically designed to expose numbering edge cases with interleaved list sequences. Results comparison:
@mkhludnev Could you re-test with your document to confirm the issue is resolved on your end as well? @ceberam If everything looks good, I'd appreciate your review and approval when you get a chance — happy to make any adjustments if needed. |
|
@emreclsr the failure I've posted above is fixed with your last commit. Thanks a lot. FWIW, list items numbers are still weird, but it's rather odd doc, I'm not even sure is it worth to bother about correct numbering and fix'em. Summary
Please proceed with #2678 Thank you so much, collegues! |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
injected-problem.docx at main, but converted with only one warning with this patch. I'm sure this doc is absolutely weird. Just opening it in Libre office Writer make convertible for main branch version. Also, after I replaced text nodes in document xml, it seems like numbered items have lost. So, it's not the best reproducer, but it let to check key error exception in main branch. |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
1 similar comment
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
3a08432 to
bcf17c2
Compare
|
@emreclsr could you please resolve the current conflict? Sorry about that, since this already happened while waiting for the review, but we'll try to address this PR ASAP. |
bcf17c2 to
fe7ee9d
Compare
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
@ceberam I've resolved the conflict and rebased the branch onto the latest main. If everything looks good, I'd appreciate it if we could get this merged. Happy to make any adjustments if needed. |
ceberam
left a comment
There was a problem hiding this comment.
Thanks @emreclsr for this new iteration and for discovering another issue.
I have checked the results of the new implementation on the test case and I saw a potential inconsistency. I highlighted it on the .itxt file because it is easier to explain, but they can be found in the Docling document serialization (JSON file). Could you please check?
Plus a very minor style comment.
| item-56 at level 6: list_item: Section A.1 | ||
| item-57 at level 6: list: group list | ||
| item-58 at level 7: list_item: Detail A.1.1 | ||
| item-59 at level 6: list: group list |
There was a problem hiding this comment.
Not sure if we should open a new group list. The item Hardware Constraints – Egde Case appears at the same level as Detail A.1.1, even though the marker goes to 2.3.1 from 1.1.1.
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
…hical markers
Fixes incorrect numbering and missing items in DOCX documents that use
multiple interleaved numbering sequences (numIds).
Changes:
* Reset sub-level counters in _get_list_counter when a parent level
advances, preventing counter bleed-across (e.g. "4. Functional
Requirements" now correctly renders as "1. Functional Requirements")
* Add _build_enum_marker helper to produce hierarchical markers in
"1.2.3." format instead of flat single-level counters
* Fix anchor-based level calculation in new-sequence branch: use
level_at_new_list + ilevel instead of _get_level() to correctly
place items from a different numId at the right document level
* Only set level_at_new_list in the else case (when None) to avoid
corrupting the anchor when switching between interleaved numIds
* Remove _reset_list_counters_for_new_sequence from new-sequence branch
so that returning to a previously seen numId continues its counter
(e.g. Appendix A=1, B=2, C=3 instead of A=1, B=1, C=1)
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
…e reset helpers Adds test_list_counter_and_enum_marker covering helper methods introduced in the list numbering fix: counter increment, sub-level reset on parent advance, hierarchical marker building, and selective sequence reset. Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
When Word creates a new numbering definition that continues from a
previous list, it embeds start values in the abstractNum XML instead of
reusing the same numId. Docling previously ignored these start values
and always initialized counters from 1, producing incorrect markers
like "1.1.1." instead of "2.3.1.".
Changes:
* Add _get_level_element helper to extract level XML from abstractNum,
eliminating duplicated XML traversal in _is_numbered_list
* Add _get_start_value to read w:start from the numbering definition
* Initialize counters in _get_list_counter using start values
* Use start values as fallback in _build_enum_marker for parent levels
that have not been explicitly incremented
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
…ered
Extends the existing test document with an Appendix section that uses a
different numId, followed by list items that resume the original
numbering sequence with Word-embedded start values (e.g. 2.3.1.).
Updates groundtruth files accordingly.
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
b49df92 to
0777f36
Compare
Description
This commit fixes an issue where list items immediately following headings (especially numbered headings) were not being processed correctly in Word documents.
While working with Word document conversion, I encountered a similar issue to #2665 where list items after headings were missing from the output. This PR addresses the root cause of this problem.
Changes
elseblock with explicitelifcondition for new list sequencesTesting
Added test case
test_list_items_after_numbered_headingto verify:All existing tests continue to pass. No groundtruth files were modified, indicating the fix addresses the bug without breaking existing functionality.
Related Issues
This fix may also resolve #2665 as it addresses the same underlying problem.
Checklist
test_list_items_after_numbered_heading