fix(docx): restore parent stack after processing rich table cells#3047
fix(docx): restore parent stack after processing rich table cells#3047Br1an67 wants to merge 3 commits intodocling-project:mainfrom
Conversation
|
✅ DCO Check Passed Thanks @Br1an67, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
In _handle_tables, when _walk_linear is called for rich table cells, it modifies self.parents, self.level, and self.level_at_new_list. These changes leak into subsequent document processing, causing sections after tables with formatted cells to be incorrectly nested. Save and restore the parser state (parents dict, level, and level_at_new_list) around the _walk_linear call for rich cells. Update ground truth for docx_rich_cells to reflect the corrected document structure. Resolves docling-project#2668 Signed-off-by: Br1an67 <932039080@qq.com>
4211311 to
f6c3d14
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
The def line for test_add_header_footer was accidentally removed, causing its body to be absorbed into the preceding test function. Signed-off-by: Br1an67 <932039080@qq.com>
Save and restore self.parents, self.level, and self.level_at_new_list around the _walk_linear call for 1x1 furniture tables, matching the same pattern used for rich table cells. Without this, a 1x1 table followed by a section header could still cause incorrect nesting. Signed-off-by: Br1an67 <932039080@qq.com>
ceberam
left a comment
There was a problem hiding this comment.
Thanks @Br1an67 for your contribution!
I left a suggestion for your changes in msword_backend.py.
Regarding the test, it's very helpful to include a use case. However, I would add it to the already existing document for testing rich table cells: docx_rich_cells.docx. We can then keep the same type of use cases in the same place. Ground truth files will need to regenerated, and not only the itxt ones. Therefore, I would recommend the following steps:
- Sync your fork to the main repository and rebase to the latest
mainbranch - Resolve eventual conflicts
- Ensure the style checks pass:
uv run pre-commit run --all-files - Copy the content of
table_bold_header.docxand append it to the existingdocx_rich_cells.docx. Delete thetable_bold_header.*ground truth files. - Regenerate the
docxground truth files:- Set the environment variable:
export DOCLING_GEN_TEST_DATA=True - Run the
docxtests:uv run pytest tests/test_backend_msword.py
- Set the environment variable:
- Inspect the difference of the new ground truth files and ensure that the changes are legitimate (for instance, a simple upgrade of the Docling version, in field
versionof the.jsonfiles) - Commit the changes
- Force-push to the remote repository
Let me know if any question!
| saved_parents = dict(self.parents) | ||
| saved_level = self.level | ||
| saved_level_at_new_list = self.level_at_new_list | ||
| self._walk_linear(cell_element._element, doc) | ||
| self.parents = saved_parents | ||
| self.level = saved_level | ||
| self.level_at_new_list = saved_level_at_new_list |
There was a problem hiding this comment.
This is a valid fix and it does the job.
However, to increase readability and to keep some consistency across the backend, what about using the context manager pattern like we do in the HTML backend?
You can define the changes needed before and after running a walk in a function, with the @contextmanager decorator:
https://github.com/docling-project/docling/blob/main/docling/backend/html_backend.py#L955-L968
and then simply walk within that context:
https://github.com/docling-project/docling/blob/main/docling/backend/html_backend.py#L579-L580
You can use it too with the other changes further below.
Issue resolved by this Pull Request:
Resolves #2668
Description
In
_handle_tables, when_walk_linearis called for rich table cells, it modifiesself.parents,self.level, andself.level_at_new_list. These changes leak into subsequent document processing, causing sections after tables with formatted cells (bold, italic, etc.) to be incorrectly nested under the table content.This fix saves and restores the parser state (
parentsdict,level, andlevel_at_new_list) around the_walk_linearcall for rich cells, preventing state contamination.Changes
docling/backend/msword_backend.py: Save/restoreself.parents,self.level,self.level_at_new_listbefore/after_walk_linearin rich cell processingtests/test_backend_msword.py: Addtest_rich_table_cell_parent_stack_preservedtesttests/data/docx/table_bold_header.docx: Test fixture with rich table cells followed by sectionsdocx_rich_cells.docxto reflect corrected document structureChecklist: