Open
Conversation
Modified extract_cells() to detect and extract image blocks (type==1) within table cells, not just text blocks (type==0). Changes: - Updated extract_cells() to accept page and document parameters - Added logic to detect image blocks within cell bounding boxes - Implemented image extraction and saving for cells with images - Images are now embedded in cell markdown as  syntax - Updated table_to_markdown() and table_extract() signatures - Updated calls in document_layout.py to pass page/document context - Added test script to demonstrate the fix When write_images=True or embed_images=True, images found in table cells are now properly extracted and referenced inline within the cell markdown, resolving the issue where images appeared below tables.
This fix enables images to appear inside their corresponding table cells instead of being extracted separately below the table. Changes for LEGACY MODE (pymupdf_rag.py): - Added add_images_to_table_markdown() function to detect images within table cell boundaries - Images with >50% overlap with a cell are assigned to that cell - Generates unique filenames for table cell images - Supports both write_images and embed_images modes - Inserts  markdown syntax inline with cell text - Updated all 3 locations where table.to_markdown() is called Changes for LAYOUT MODE (document_layout.py): - Updated table_blocks to include image blocks (type==1) - Modified extract_cells() to detect and extract images in cells - Added page/document parameters to table extraction functions - Images are extracted and referenced inline in cells TESTING: Fully tested with embedded images in PDFs. All images correctly appear inside their table cells in the markdown output. Before fix: | Col1 | Col2 | Image | |---|---|---| | Text | Text | |  After fix: | Col1 | Col2 | Image | |---|---|---| | Text | Text |  | Resolves the requested behavior from Issue pymupdf#21.
Author
|
I have read the CLA Document and I hereby sign the CLA |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #21
Problem
Images in table cells were appearing below tables instead of inside the cells.
Solution
Implemented image detection within table cells for both legacy and layout modes.
Before Fix
After Fix
Technical Changes
Legacy Mode (pymupdf_rag.py) - All users:
add_images_to_table_markdown()functionmarkdown inlinetable.to_markdown()Layout Mode (document_layout.py) - pymupdf_layout users:
Testing
Benefits