Skip to content

fix(docx): Missing list items after numbered header (#2665)#2678

Open
emreclsr wants to merge 6 commits intodocling-project:mainfrom
emreclsr:fix/msword-list-items-after-heading
Open

fix(docx): Missing list items after numbered header (#2665)#2678
emreclsr wants to merge 6 commits intodocling-project:mainfrom
emreclsr:fix/msword-list-items-after-heading

Conversation

@emreclsr
Copy link

Description

This commit fixes an issue where list items immediately following headings (especially numbered headings) were not being processed correctly in Word documents.

While working with Word document conversion, I encountered a similar issue to #2665 where list items after headings were missing from the output. This PR addresses the root cause of this problem.

Changes

  • Clear list history after headings to allow new lists to start
  • Reset level tracking when heading resets hierarchy
  • Reset level_at_new_list when a heading is added
  • Replace else block with explicit elif condition for new list sequences
    • Only continue existing list if parent is actually a ListGroup
    • Handle different numid or missing ListGroup parent by creating new list

Testing

Added test case test_list_items_after_numbered_heading to verify:

  • List items appear correctly after numbered headings
  • Heading has ListGroup as child
  • List group contains the expected list items
  • Document structure is properly maintained

All existing tests continue to pass. No groundtruth files were modified, indicating the fix addresses the bug without breaking existing functionality.

Related Issues

This fix may also resolve #2665 as it addresses the same underlying problem.

Checklist

  • Documentation has been updated, if necessary. (N/A - internal fix)
  • Examples have been added, if necessary. (N/A - bug fix)
  • Tests have been added, if necessary. Added test_list_items_after_numbered_heading

@github-actions
Copy link
Contributor

github-actions bot commented Nov 25, 2025

DCO Check Passed

Thanks @emreclsr, all your commits are properly signed off. 🎉

@dosubot
Copy link

dosubot bot commented Nov 25, 2025

Related Documentation

Checked 4 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@mergify
Copy link

mergify bot commented Nov 25, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@ceberam ceberam added bug Something isn't working docx issue related to docx backend labels Nov 26, 2025
@codecov
Copy link

codecov bot commented Nov 26, 2025

Codecov Report

❌ Patch coverage is 88.40580% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/msword_backend.py 88.40% 8 Missing ⚠️

📢 Thoughts on this report? Let us know!

@emreclsr
Copy link
Author

Description

This PR improves test coverage for the list-after-heading fix by adding test cases for enumerated (numbered) lists following headings, achieving 100% code coverage for the related changes.

Changes

Test Data

  • Extended list_after_num_headers.docx with enumerated list test case
    • Added "Title 2" → "Sub Title 1" heading with numbered list items

Test Code

  • Enhanced test_list_items_after_numbered_heading to verify both:
    • Bullet lists after numbered headings (existing)
    • Enumerated lists after numbered headings (new)

Groundtruth Files

  • Updated .md, .itxt, and .json files to reflect new test content

@mkhludnev
Copy link
Contributor

I've tried this fix on one .docx file it falls with

    def _add_list_item(
        self,
        *,
        doc: DoclingDocument,
        numid: int,
        ilevel: int,
        elements: list,
        is_numbered: bool = False,
    ) -> list[RefItem]:
....
        ):  # Open indented list
            for i in range(
                self.level_at_new_list + prev_indent + 1,
                self.level_at_new_list + ilevel + 1,
            ):
                list_gr1 = doc.add_list_group(
                    name="list",
>                   parent=self.parents[i - 1],
                           ^^^^^^^^^^^^^^^^^^^
                    content_layer=self.content_layer,
                )
E               KeyError: 18

docling/backend/msword_backend.py:1272: KeyError

sigh

@emreclsr
Copy link
Author

@mkhludnev Thanks for testing could you share the .docx file (or a minimal reproducer) that triggers this KeyError? I'd like to debug and fix it.

@ceberam
Copy link
Member

ceberam commented Mar 6, 2026

@emreclsr we recently merged this PR #3070 and it could be that it fixed the issue described in this conversation. Could you please try the current version in main and see if the issues are gone?

@mkhludnev
Copy link
Contributor

@ceberam
I've got to my ill docx

(.venv)(base) ~/git/docling git:[main]
git log
commit b7815658d12390343669869a5065e80e5038369f (HEAD -> main, tag: v2.77.0, origin/main, origin/HEAD)
Author: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Date:   Fri Mar 6 13:45:28 2026 +0000

    chore: bump version to 2.77.0 [skip ci]

I've checked out recently

uv pip install .
Resolved 112 packages in 607ms

launch script

converter = DocumentConverter()
result = converter.convert(source)

failure

Parent element of the list item is not a ListGroup. The list item will be ignored.
Parent element of the list item is not a ListGroup. The list item will be ignored.
Parent element of the list item is not a ListGroup. The list item will be ignored.
Parent element of the list item is not a ListGroup. The list item will be ignored.
Parent element of the list item is not a ListGroup. The list item will be ignored.
Traceback (most recent call last):
  File "/home/git/docling/docling/pipeline/base_pipeline.py", line 74, in execute
    conv_res = self._build_document(conv_res)
  File "/home/git/docling/docling/pipeline/simple_pipeline.py", line 40, in _build_document
    conv_res.document = conv_res.input._backend.convert()
                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/home/git/docling/docling/backend/msword_backend.py", line 168, in convert
    doc, _ = self._walk_linear(self.docx_obj.element.body, doc)
             ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/git/docling/docling/backend/msword_backend.py", line 362, in _walk_linear
    te = self._handle_text_elements(element, doc)
  File "/home/git/docling/docling/backend/msword_backend.py", line 982, in _handle_text_elements
    li = self._add_list_item(
        doc=doc,
    ...<3 lines>...
        is_numbered=is_numbered,
    )
  File "/home/git/docling/docling/backend/msword_backend.py", line 1318, in _add_list_item
    parent=self.parents[i - 1],
           ~~~~~~~~~~~~^^^^^^^
KeyError: 18

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/git/docling/docling/launch.py", line 8, in <module>
    result = converter.convert(source)
  File "/home/git/docling/.venv/lib/python3.13/site-packages/pydantic/_internal/_validate_call.py", line 39, in wrapper_function
    return wrapper(*args, **kwargs)
  File "/home/git/docling/.venv/lib/python3.13/site-packages/pydantic/_internal/_validate_call.py", line 136, in __call__
    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
  File "/home/git/docling/docling/document_converter.py", line 388, in convert
    return next(all_res)
  File "/home/git/docling/docling/document_converter.py", line 447, in convert_all
    for conv_res in conv_res_iter:
                    ^^^^^^^^^^^^^
  File "/home/git/docling/docling/document_converter.py", line 564, in _convert
    for item in map(
                ~~~^
        process_func,
        ^^^^^^^^^^^^^
        input_batch,
        ^^^^^^^^^^^^
    ):
    ^
  File "/home/git/docling/docling/document_converter.py", line 611, in _process_document
    conv_res = self._execute_pipeline(in_doc, raises_on_error=raises_on_error)
  File "/home/git/docling/docling/document_converter.py", line 634, in _execute_pipeline
    conv_res = pipeline.execute(in_doc, raises_on_error=raises_on_error)
  File "/home/git/docling/docling/pipeline/base_pipeline.py", line 89, in execute
    raise RuntimeError(f"Pipeline {self.__class__.__name__} failed") from e
RuntimeError: Pipeline SimplePipeline failed

Process finished with exit code 1

@ceberam
Copy link
Member

ceberam commented Mar 6, 2026

Thanks @mkhludnev . It seems we hit another edge case. We should apply some safeguards to avoid the type of error you shared, in any case. Can you share the docx document? (or a version that could be publicly shared).
I will check it in more detail next week and eventually try to reuse this PR.

@mkhludnev
Copy link
Contributor

I noticed that the problematic item in Writer numbered as
2.7.8.1..
However Tika numbers it as 1.14.8.1 so, it's rather weird doc.
I don't feel like I'm able to share that one docx.

@emreclsr emreclsr force-pushed the fix/msword-list-items-after-heading branch from 1b1e05d to 8154490 Compare March 6, 2026 21:37
@mergify
Copy link

mergify bot commented Mar 6, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@emreclsr
Copy link
Author

emreclsr commented Mar 6, 2026

To help demonstrate and validate the reported issue, I prepared a test Word document specifically designed to expose numbering edge cases with interleaved list sequences.

Test Document.docx

Results comparison:

  • Current main branch — parses the document as shown in main.md
  • Previous version of this PR — produced the output in pre-changes.md
  • After the latest changes — the document is parsed correctly as shown in after-changes.md

@mkhludnev Could you re-test with your document to confirm the issue is resolved on your end as well?

@ceberam If everything looks good, I'd appreciate your review and approval when you get a chance — happy to make any adjustments if needed.

@mkhludnev
Copy link
Contributor

@emreclsr the failure I've posted above is fixed with your last commit. Thanks a lot.
@ceberam I quickly skim through the conversion output I noticed that the structure goes too deep, examine xml gives 4-5 level of nesting but processing by code main goes up to 18th - see exception above. So currently main code has unnecessary depth issue.

FWIW, list items numbers are still weird, but it's rather odd doc, I'm not even sure is it worth to bother about correct numbering and fix'em.

Summary

Please proceed with #2678

Thank you so much, collegues!

@mergify
Copy link

mergify bot commented Mar 7, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@mkhludnev
Copy link
Contributor

injected-problem.docx
@ceberam here's the doc which fails

msword_backend.py", line 1318, in _add_list_item
    parent=self.parents[i - 1],
           ~~~~~~~~~~~~^^^^^^^
KeyError: 16

at main, but converted with only one warning with this patch.

I'm sure this doc is absolutely weird. Just opening it in Libre office Writer make convertible for main branch version. Also, after I replaced text nodes in document xml, it seems like numbered items have lost. So, it's not the best reproducer, but it let to check key error exception in main branch.

@ceberam ceberam self-requested a review March 10, 2026 12:19
@ceberam ceberam self-assigned this Mar 10, 2026
@mergify
Copy link

mergify bot commented Mar 10, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

1 similar comment
@mergify
Copy link

mergify bot commented Mar 13, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@emreclsr emreclsr force-pushed the fix/msword-list-items-after-heading branch from 3a08432 to bcf17c2 Compare March 14, 2026 07:04
@ceberam
Copy link
Member

ceberam commented Mar 16, 2026

@emreclsr could you please resolve the current conflict? Sorry about that, since this already happened while waiting for the review, but we'll try to address this PR ASAP.

@emreclsr emreclsr force-pushed the fix/msword-list-items-after-heading branch from bcf17c2 to fe7ee9d Compare March 16, 2026 16:50
@mergify
Copy link

mergify bot commented Mar 16, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@emreclsr
Copy link
Author

@ceberam I've resolved the conflict and rebased the branch onto the latest main. If everything looks good, I'd appreciate it if we could get this merged. Happy to make any adjustments if needed.

Copy link
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emreclsr for this new iteration and for discovering another issue.

I have checked the results of the new implementation on the test case and I saw a potential inconsistency. I highlighted it on the .itxt file because it is easier to explain, but they can be found in the Docling document serialization (JSON file). Could you please check?
Plus a very minor style comment.

item-56 at level 6: list_item: Section A.1
item-57 at level 6: list: group list
item-58 at level 7: list_item: Detail A.1.1
item-59 at level 6: list: group list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should open a new group list. The item Hardware Constraints – Egde Case appears at the same level as Detail A.1.1, even though the marker goes to 2.3.1 from 1.1.1.

@mergify
Copy link

mergify bot commented Mar 18, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

…hical markers

  Fixes incorrect numbering and missing items in DOCX documents that use
  multiple interleaved numbering sequences (numIds).

  Changes:
  * Reset sub-level counters in _get_list_counter when a parent level
    advances, preventing counter bleed-across (e.g. "4. Functional
    Requirements" now correctly renders as "1. Functional Requirements")
  * Add _build_enum_marker helper to produce hierarchical markers in
    "1.2.3." format instead of flat single-level counters
  * Fix anchor-based level calculation in new-sequence branch: use
    level_at_new_list + ilevel instead of _get_level() to correctly
    place items from a different numId at the right document level
  * Only set level_at_new_list in the else case (when None) to avoid
    corrupting the anchor when switching between interleaved numIds
  * Remove _reset_list_counters_for_new_sequence from new-sequence branch
    so that returning to a previously seen numId continues its counter
    (e.g. Appendix A=1, B=2, C=3 instead of A=1, B=1, C=1)

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
…e reset helpers

  Adds test_list_counter_and_enum_marker covering helper methods introduced
  in the list numbering fix: counter increment, sub-level reset on parent
  advance, hierarchical marker building, and selective sequence reset.

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
    When Word creates a new numbering definition that continues from a
    previous list, it embeds start values in the abstractNum XML instead of
    reusing the same numId. Docling previously ignored these start values
    and always initialized counters from 1, producing incorrect markers
    like "1.1.1." instead of "2.3.1.".

    Changes:
    * Add _get_level_element helper to extract level XML from abstractNum,
      eliminating duplicated XML traversal in _is_numbered_list
    * Add _get_start_value to read w:start from the numbering definition
    * Initialize counters in _get_list_counter using start values
    * Use start values as fallback in _build_enum_marker for parent levels
      that have not been explicitly incremented

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
…ered

    Extends the existing test document with an Appendix section that uses a
    different numId, followed by list items that resume the original
    numbering sequence with Word-embedded start values (e.g. 2.3.1.).
    Updates groundtruth files accordingly.

Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
Signed-off-by: Emre Çalışır <emrecalisir95@gmail.com>
@emreclsr emreclsr force-pushed the fix/msword-list-items-after-heading branch from b49df92 to 0777f36 Compare March 19, 2026 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working docx issue related to docx backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants