Feature request: Extract both texts and tables on the same page #3095

jimmyzzxhlh · 2024-01-24T18:31:14Z

jimmyzzxhlh
Jan 24, 2024

Is your feature request related to a problem? Please describe.
I'm testing out extract texts and tables using pymupdf. Some pages in the PDF may contain both texts and a table.
Example (tables starting from page 3):
https://www.aetnamedicare.com/documents/individual/2024/summaryofbenefits/Y0001_H5521_127_PQ05_SB24_M.pdf

Pymupdf works great with extracting the tables using Page.find_tables() and it correctly identifies rows/columns. However I haven't found a great way to extract both texts outside of tables and the tables on the same page.

Ideally, I would expect a function something like get_text_and_tables() which will return a list of either text or tables in natural reading order. Then based on the type of the element I can determine what to do with the text or the table.

The closest thing I can think of for now is the following, but it's probably going to be error prone.

Call Page.get_text() to extract all the text (which will contain texts from the tables on the page)
Call Page.find_tables() to extract the tables
Figure out the first cell and the last cell of each table, and delete the corresponding texts from Page.get_text(). Then try to combine the texts and the tables together.

JorjMcKie · 2024-01-25T10:37:35Z

JorjMcKie
Jan 25, 2024
Maintainer

I recommend to use the following simple logic in your own code:

find all tables on page and make a list of their bboxes (in Rect format)
extract page text in chunks (i.e. clips) using clip rectangles that have a top coordinate equal to the bottom coordinate (y1) of the preceding table (or top of page), and a bottom coordinate of the table top (y0) that follows (or page bottom if after last table).

It is not worth to include this in the package itself. On the contrary: in the general case this may become extremely complex, You may have tables with text on either side, or tables with overlapping top/bottom intervals, etc., and so on.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature request: Extract both texts and tables on the same page #3095

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feature request: Extract both texts and tables on the same page #3095

Uh oh!

Uh oh!

jimmyzzxhlh Jan 24, 2024

Replies: 1 comment

Uh oh!

JorjMcKie Jan 25, 2024 Maintainer

jimmyzzxhlh
Jan 24, 2024

JorjMcKie
Jan 25, 2024
Maintainer