How to identify scanned pdf #1653

swathy-z9q · 2022-03-28T07:45:55Z

swathy-z9q
Mar 28, 2022

Hi,

I require ideas on how to identify a pdf as scanned.

I came across a pdf, where a scanned image was converted to pdf by a producer "www.ilovepdf.com" and have also identified many other producers as well which does OCR on scanned images and then converts to pdfs. This pdf file when opened by Fitz consists of text content generated by the producer, hence looks similar to proper native pdf or pdf generated from the word document(Not scanned pdfs).

I need help in classifying which pdf is scanned or native. Is there any way to use pdf metadata or please suggest your ideas

Thanks in advance

Answered by JorjMcKie

Mar 28, 2022

There is no failsafe answer to this. But a few heuristics can at least help:

A scanned page will presumably be completely covered by an image (or two images - depends on the scanner used). So you can look at the images on the page and compare its / their rectangle(s) with the page rectangle. It may happen, that the image rectangle is not exactly equal to the page rectangle. So do not check for equality, but allow for some deviation like abs(image_bbox & page.rect) / abs(page.rect) >= 0.95. This means that the intersection area of image and page should at least cover 95% of the page ... you get the idea.
If text can be extracted and Tesseract was used for OCR, then a specific fontname, "G…

View full answer

JorjMcKie · 2022-03-28T09:41:31Z

JorjMcKie
Mar 28, 2022
Maintainer

There is no failsafe answer to this. But a few heuristics can at least help:

A scanned page will presumably be completely covered by an image (or two images - depends on the scanner used). So you can look at the images on the page and compare its / their rectangle(s) with the page rectangle. It may happen, that the image rectangle is not exactly equal to the page rectangle. So do not check for equality, but allow for some deviation like abs(image_bbox & page.rect) / abs(page.rect) >= 0.95. This means that the intersection area of image and page should at least cover 95% of the page ... you get the idea.
If text can be extracted and Tesseract was used for OCR, then a specific fontname, "GlyphlessFont", will be found in text extraction using page.get_text("dict"). Should also be found looking at page.get_fonts().
Depending on the scanner, some information may also be put into PDF metadata. This obviously is highly dependant on the product used.

1 reply

swathy-z9q Mar 28, 2022
Author

Thanks a lot for sharing these approaches 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to identify scanned pdf #1653

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to identify scanned pdf #1653

Uh oh!

Uh oh!

swathy-z9q Mar 28, 2022

Replies: 1 comment · 1 reply

Uh oh!

JorjMcKie Mar 28, 2022 Maintainer

Uh oh!

swathy-z9q Mar 28, 2022 Author

swathy-z9q
Mar 28, 2022

Replies: 1 comment 1 reply

JorjMcKie
Mar 28, 2022
Maintainer

swathy-z9q Mar 28, 2022
Author