Spotting text in PDF table cells #1657
-
Hi i'm working on a user case where i need to de-structure the whole pdf document and restructure in the flutter app.
tables, drawings and extracted images will fit under Problem
This should work for drawings in the provided documents. i expect this to work for tables as well but no text has been extracted if there's a table as drawing. The Documents are attached. Please help me out I've tried the rects in get_texttrace as well they're also not returning the full text. I've been stuck at this for days given the potential of pymupdf this is best for for my use case if i can just get all the text for tables as well just like i'm extracting for drawings that'll solve a lot of my problems. |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 11 replies
-
I haven't yet looked into details, but here a first hint: |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Depends on how lines are drawn. Many PDF creators do not actually draw lines, but thin rectangles instead. |
Beta Was this translation helpful? Give feedback.
-
I am not sure what your ultimate goal actually is. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
As I wrote: your PDF creator has decided to not draw lines, but instead thin rectangles. MS Word, LibreOffice always do the same when exporting office documents to PDF. |
Beta Was this translation helpful? Give feedback.
-
I am taking the liberty to change the issue title to something which gives others an idea what it is all about. |
Beta Was this translation helpful? Give feedback.
-
Hi, i've found a solution to my problem by using Thank you for such a beautiful package. :) |
Beta Was this translation helpful? Give feedback.
Indeed, if you do this:
You get this:
So except for the header, no text is extracted based on drawing rectangles.