Skip to content

fix(docx): restore parent stack after processing rich table cells#3047

Open
Br1an67 wants to merge 3 commits intodocling-project:mainfrom
Br1an67:fix/table-parent-stack-restore
Open

fix(docx): restore parent stack after processing rich table cells#3047
Br1an67 wants to merge 3 commits intodocling-project:mainfrom
Br1an67:fix/table-parent-stack-restore

Conversation

@Br1an67
Copy link
Contributor

@Br1an67 Br1an67 commented Mar 1, 2026

Issue resolved by this Pull Request:
Resolves #2668

Description

In _handle_tables, when _walk_linear is called for rich table cells, it modifies self.parents, self.level, and self.level_at_new_list. These changes leak into subsequent document processing, causing sections after tables with formatted cells (bold, italic, etc.) to be incorrectly nested under the table content.

This fix saves and restores the parser state (parents dict, level, and level_at_new_list) around the _walk_linear call for rich cells, preventing state contamination.

Changes

  • docling/backend/msword_backend.py: Save/restore self.parents, self.level, self.level_at_new_list before/after _walk_linear in rich cell processing
  • tests/test_backend_msword.py: Add test_rich_table_cell_parent_stack_preserved test
  • tests/data/docx/table_bold_header.docx: Test fixture with rich table cells followed by sections
  • Ground truth updates for docx_rich_cells.docx to reflect corrected document structure

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 1, 2026

DCO Check Passed

Thanks @Br1an67, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Mar 1, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@dosubot
Copy link

dosubot bot commented Mar 1, 2026

Related Documentation

Checked 20 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

In _handle_tables, when _walk_linear is called for rich table cells,
it modifies self.parents, self.level, and self.level_at_new_list.
These changes leak into subsequent document processing, causing
sections after tables with formatted cells to be incorrectly nested.

Save and restore the parser state (parents dict, level, and
level_at_new_list) around the _walk_linear call for rich cells.

Update ground truth for docx_rich_cells to reflect the corrected
document structure.

Resolves docling-project#2668

Signed-off-by: Br1an67 <932039080@qq.com>
@Br1an67 Br1an67 force-pushed the fix/table-parent-stack-restore branch from 4211311 to f6c3d14 Compare March 1, 2026 08:13
@codecov
Copy link

codecov bot commented Mar 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

The def line for test_add_header_footer was accidentally removed,
causing its body to be absorbed into the preceding test function.

Signed-off-by: Br1an67 <932039080@qq.com>
@cau-git cau-git changed the title fix: restore parent stack after processing rich table cells fix(msword): restore parent stack after processing rich table cells Mar 16, 2026
@cau-git cau-git requested a review from ceberam March 16, 2026 14:38
@cau-git cau-git changed the title fix(msword): restore parent stack after processing rich table cells fix(docx): restore parent stack after processing rich table cells Mar 16, 2026
Save and restore self.parents, self.level, and self.level_at_new_list
around the _walk_linear call for 1x1 furniture tables, matching the
same pattern used for rich table cells. Without this, a 1x1 table
followed by a section header could still cause incorrect nesting.

Signed-off-by: Br1an67 <932039080@qq.com>
Copy link
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Br1an67 for your contribution!

I left a suggestion for your changes in msword_backend.py.

Regarding the test, it's very helpful to include a use case. However, I would add it to the already existing document for testing rich table cells: docx_rich_cells.docx. We can then keep the same type of use cases in the same place. Ground truth files will need to regenerated, and not only the itxt ones. Therefore, I would recommend the following steps:

  • Sync your fork to the main repository and rebase to the latest main branch
  • Resolve eventual conflicts
  • Ensure the style checks pass: uv run pre-commit run --all-files
  • Copy the content of table_bold_header.docx and append it to the existing docx_rich_cells.docx. Delete the table_bold_header.* ground truth files.
  • Regenerate the docx ground truth files:
    • Set the environment variable: export DOCLING_GEN_TEST_DATA=True
    • Run the docx tests: uv run pytest tests/test_backend_msword.py
  • Inspect the difference of the new ground truth files and ensure that the changes are legitimate (for instance, a simple upgrade of the Docling version, in field version of the .json files)
  • Commit the changes
  • Force-push to the remote repository

Let me know if any question!

Comment on lines +1404 to +1410
saved_parents = dict(self.parents)
saved_level = self.level
saved_level_at_new_list = self.level_at_new_list
self._walk_linear(cell_element._element, doc)
self.parents = saved_parents
self.level = saved_level
self.level_at_new_list = saved_level_at_new_list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid fix and it does the job.
However, to increase readability and to keep some consistency across the backend, what about using the context manager pattern like we do in the HTML backend?
You can define the changes needed before and after running a walk in a function, with the @contextmanager decorator:

https://github.com/docling-project/docling/blob/main/docling/backend/html_backend.py#L955-L968

and then simply walk within that context:

https://github.com/docling-project/docling/blob/main/docling/backend/html_backend.py#L579-L580

You can use it too with the other changes further below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Section after table being incorrectly added as child of table header

2 participants