How to identify scanned pdf #1653
-
Hi, I require ideas on how to identify a pdf as scanned. I came across a pdf, where a scanned image was converted to pdf by a producer "www.ilovepdf.com" and have also identified many other producers as well which does OCR on scanned images and then converts to pdfs. This pdf file when opened by Fitz consists of text content generated by the producer, hence looks similar to proper native pdf or pdf generated from the word document(Not scanned pdfs). I need help in classifying which pdf is scanned or native. Is there any way to use pdf metadata or please suggest your ideas Thanks in advance |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
There is no failsafe answer to this. But a few heuristics can at least help:
|
Beta Was this translation helpful? Give feedback.
There is no failsafe answer to this. But a few heuristics can at least help:
abs(image_bbox & page.rect) / abs(page.rect) >= 0.95
. This means that the intersection area of image and page should at least cover 95% of the page ... you get the idea.