Some contents in pdf page is not identifying either as text or image #1876

UdayaKUnnikrishnan · 2022-08-16T09:03:58Z

UdayaKUnnikrishnan
Aug 16, 2022

Pdf' page contain text that can not be copied. Such text regions is not identified as image or text by pymupdf.
Please help me how can I extract such contents from pdf page.

Answered by JorjMcKie

Aug 16, 2022

You must provide an example - otherwise I cannot help.
As is typical for PDF, there are a number of possible explanations. Among them:
You can use so-called "line art" to simulate text - like a capital "D" can be drawn by a line "|" and a left-open semi-circle.
Line art is neither text nor image ...

View full answer

JorjMcKie · 2022-08-16T09:12:25Z

JorjMcKie
Aug 16, 2022
Maintainer

You must provide an example - otherwise I cannot help.
As is typical for PDF, there are a number of possible explanations. Among them:
You can use so-called "line art" to simulate text - like a capital "D" can be drawn by a line "|" and a left-open semi-circle.
Line art is neither text nor image ...

4 replies

UdayaKUnnikrishnan Aug 16, 2022
Author

Thank you for the quick reply. I could not share the pdf as it is confidential. I used page.get_bboxlog() to get the list of rectangles in the page and created a sample pdf using this info . I am attaching the pdf created using these rectangle info.
sample.pdf

Text present in the fill-path rectangles( marked with red line) can not be extracted. I need to extract text from these regions

JorjMcKie Aug 16, 2022
Maintainer

Well done! "Fill path" and "stroke path" are areas where there exists line art.
So my assumption seems to be correct: it is text simulated by line art.
While you certainly can extract those drawings themselves, you will not be able to interpret it as text of course.
This means, your only option is OCR-ing the page and then extract the recognized text.
PyMuPDF support OCR via an installed Tesseract software - presumably you know that and how to get it going.

JorjMcKie Aug 16, 2022
Maintainer

Or install / use ocrmypdf, OCR the file, then process its output PDF as usual.

UdayaKUnnikrishnan Aug 16, 2022
Author

Thank you so much for your support. Yes I know about ocr support in PyMuPDF.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some contents in pdf page is not identifying either as text or image #1876

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Some contents in pdf page is not identifying either as text or image #1876

Uh oh!

UdayaKUnnikrishnan Aug 16, 2022

Replies: 1 comment · 4 replies

Uh oh!

Uh oh!

JorjMcKie Aug 16, 2022 Maintainer

Uh oh!

Uh oh!

UdayaKUnnikrishnan Aug 16, 2022 Author

Uh oh!

JorjMcKie Aug 16, 2022 Maintainer

Uh oh!

JorjMcKie Aug 16, 2022 Maintainer

Uh oh!

UdayaKUnnikrishnan Aug 16, 2022 Author

UdayaKUnnikrishnan
Aug 16, 2022

Replies: 1 comment 4 replies

JorjMcKie
Aug 16, 2022
Maintainer

UdayaKUnnikrishnan Aug 16, 2022
Author

JorjMcKie Aug 16, 2022
Maintainer

JorjMcKie Aug 16, 2022
Maintainer

UdayaKUnnikrishnan Aug 16, 2022
Author