Feature request: Extract both texts and tables on the same page #3095
Unanswered
jimmyzzxhlh
asked this question in
Q&A
Replies: 1 comment
-
I recommend to use the following simple logic in your own code:
It is not worth to include this in the package itself. On the contrary: in the general case this may become extremely complex, You may have tables with text on either side, or tables with overlapping top/bottom intervals, etc., and so on. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Is your feature request related to a problem? Please describe.
I'm testing out extract texts and tables using pymupdf. Some pages in the PDF may contain both texts and a table.
Example (tables starting from page 3):
https://www.aetnamedicare.com/documents/individual/2024/summaryofbenefits/Y0001_H5521_127_PQ05_SB24_M.pdf
Pymupdf works great with extracting the tables using
Page.find_tables()
and it correctly identifies rows/columns. However I haven't found a great way to extract both texts outside of tables and the tables on the same page.Ideally, I would expect a function something like
get_text_and_tables()
which will return a list of either text or tables in natural reading order. Then based on the type of the element I can determine what to do with the text or the table.The closest thing I can think of for now is the following, but it's probably going to be error prone.
Page.get_text()
to extract all the text (which will contain texts from the tables on the page)Page.find_tables()
to extract the tablesPage.get_text()
. Then try to combine the texts and the tables together.Beta Was this translation helpful? Give feedback.
All reactions