Extraction of text #2626

meghanaviyyapu · 2023-08-28T09:41:26Z

meghanaviyyapu
Aug 28, 2023

page=doc[0]
for block in page.get_text("dict",sort=True)["blocks"]:
if(block["type"]==0):
if 'lines' in block.keys():
for line in block["lines"]:
for span in line["spans"]:
print(span["text"])

I have used the above code to extract text from page 1 of the below PDF. BNL-76953-2006-CP is present visually only once but while extracting spans of text, could see
.

Can you please let me know the reason?
Uploading 32542.pdf…

Answered by JorjMcKie

Aug 31, 2023

No, they are no graphics, but so-called inline images: all image information is part of the page's /Contents object.
Because they have no xref, you don't see them using page.get_images().
But PyMuPDF doesn't let you down!
Information about all page images: page.get_image_info(...). To extract them, use text extraction: page.get_text("dict", ...).

View full answer

JorjMcKie · 2023-08-28T09:59:21Z

JorjMcKie
Aug 28, 2023
Maintainer

Clicking on your link doesn't do anything - cannot look at the file.

1 reply

JorjMcKie Aug 28, 2023
Maintainer

There may be trivial explanantions, like the unwanted text is white-on-white, or text is hidden PDF-wise (Tr vaue 3), etc.

meghanaviyyapu · 2023-08-28T10:48:47Z

meghanaviyyapu
Aug 28, 2023
Author

32542.pdf
Have uploaded the file

5 replies

meghanaviyyapu Aug 28, 2023
Author

Is there a way to eliminate the redundant text?

JorjMcKie Aug 28, 2023
Maintainer

Yes, using redaction annotations.
The potential problem with this is that everything overlapping the redaction rectangle will also be erased.
But you are lucky: You haven't noticed that your desired text comes from a PDF field. Fields and annotations will not be affected by redactions.
So you should be fine.

meghanaviyyapu Aug 28, 2023
Author

Thanks so much for explaining. Can you let me know how these scenarios can be identified while extracting spans of text using page.get_text()

JorjMcKie Aug 28, 2023
Maintainer

There can be no general recipe, can it.
In your case, with this stupid PDF, you might want to check if there are text fields.
Then, if yes, you could check if some standard text exists, whose boundary overlaps the field rectangle. If yes, delete it via redactions (or simply skip it during text extraction).
Etc.

Simply using naive text extraction page.get_text("text") will not lead you to anywhere. You - at a minimum - need some position information to find out that there are text portions overlapping each other - which should make you suspicious.

meghanaviyyapu Aug 29, 2023
Author

Thanks. Is this scenario a rare one and does it depend on how the PDF was created?

JorjMcKie · 2023-08-29T21:00:06Z

JorjMcKie
Aug 29, 2023
Maintainer

Actually - if I were to mmake such a PDF, I would create

a PDF field as it is done
give that fied a name and a comment that is shown when the mouse hovers over it
maybe an explanatory standard text to the left or right of that field

I would never ever write standard text underneath such a field - as it has happened here! What purpose does that have?!
This is clearly bad PDF creation style.

But as usual in PDF: Murphy's Law, what is possible, will happen earlier or later.

7 replies

meghanaviyyapu Aug 30, 2023
Author

Ok thanks a lot. Will try. Came across an article related to table extraction by PyMuPDF. Was really looking forward to it. Can you let me know if font properties of table headers and table data can also be extracted by the new method?

JorjMcKie Aug 30, 2023
Maintainer

For every identified table, you get the text per the cells, but also the coordinates (bbox) of each cell. So you can re-extract text of the cell with all detail, e.g. get_text("dict", clip=cell_bbox).

meghanaviyyapu Aug 30, 2023
Author

Ok is it capable of handling complex tables as well like tables with rows spanning multiple columns?

JorjMcKie Aug 30, 2023
Maintainer

No - the code is an improved port of pdfplumber's features.
Improved here means, we can also detect table column headers, and we are at least 5 times faster.
But just try it out and see how far it takes you.
Don't forget to read the documentation and to inspect the example Jupyter notebooks here.

meghanaviyyapu Aug 30, 2023
Author

Sure thanks

meghanaviyyapu · 2023-08-31T10:25:02Z

meghanaviyyapu
Aug 31, 2023
Author

Can you let me know what the rectangles on page 7 represent in the below PDF. Are they graphics?
ieeecls.pdf

2 replies

JorjMcKie Aug 31, 2023
Maintainer

No, they are no graphics, but so-called inline images: all image information is part of the page's /Contents object.
Because they have no xref, you don't see them using page.get_images().
But PyMuPDF doesn't let you down!
Information about all page images: page.get_image_info(...). To extract them, use text extraction: page.get_text("dict", ...).

Answer selected by JorjMcKie

meghanaviyyapu Aug 31, 2023
Author

Ok thanks

Extraction of text #2626

Uh oh!

meghanaviyyapu Aug 28, 2023

Replies: 4 comments · 15 replies

Uh oh!

JorjMcKie Aug 28, 2023 Maintainer

Uh oh!

JorjMcKie Aug 28, 2023 Maintainer

Uh oh!

meghanaviyyapu Aug 28, 2023 Author

Uh oh!

meghanaviyyapu Aug 28, 2023 Author

Uh oh!

JorjMcKie Aug 28, 2023 Maintainer

Uh oh!

meghanaviyyapu Aug 28, 2023 Author

Uh oh!

JorjMcKie Aug 28, 2023 Maintainer

Uh oh!

meghanaviyyapu Aug 29, 2023 Author

Uh oh!

JorjMcKie Aug 29, 2023 Maintainer

Uh oh!

meghanaviyyapu Aug 30, 2023 Author

Uh oh!

JorjMcKie Aug 30, 2023 Maintainer

Uh oh!

meghanaviyyapu Aug 30, 2023 Author

Uh oh!

JorjMcKie Aug 30, 2023 Maintainer

Uh oh!

meghanaviyyapu Aug 30, 2023 Author

Uh oh!

meghanaviyyapu Aug 31, 2023 Author

Uh oh!

JorjMcKie Aug 31, 2023 Maintainer

Uh oh!

meghanaviyyapu Aug 31, 2023 Author

meghanaviyyapu
Aug 28, 2023

Replies: 4 comments 15 replies

JorjMcKie
Aug 28, 2023
Maintainer

JorjMcKie Aug 28, 2023
Maintainer

meghanaviyyapu
Aug 28, 2023
Author

meghanaviyyapu Aug 28, 2023
Author

JorjMcKie Aug 28, 2023
Maintainer

meghanaviyyapu Aug 28, 2023
Author

JorjMcKie Aug 28, 2023
Maintainer

meghanaviyyapu Aug 29, 2023
Author

JorjMcKie
Aug 29, 2023
Maintainer

meghanaviyyapu Aug 30, 2023
Author

JorjMcKie Aug 30, 2023
Maintainer

meghanaviyyapu Aug 30, 2023
Author

JorjMcKie Aug 30, 2023
Maintainer

meghanaviyyapu Aug 30, 2023
Author

meghanaviyyapu
Aug 31, 2023
Author

JorjMcKie Aug 31, 2023
Maintainer

meghanaviyyapu Aug 31, 2023
Author