Skip to content
Discussion options

You must be logged in to vote

There is no failsafe answer to this. But a few heuristics can at least help:

  1. A scanned page will presumably be completely covered by an image (or two images - depends on the scanner used). So you can look at the images on the page and compare its / their rectangle(s) with the page rectangle. It may happen, that the image rectangle is not exactly equal to the page rectangle. So do not check for equality, but allow for some deviation like abs(image_bbox & page.rect) / abs(page.rect) >= 0.95. This means that the intersection area of image and page should at least cover 95% of the page ... you get the idea.
  2. If text can be extracted and Tesseract was used for OCR, then a specific fontname, "G…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@swathy-z9q
Comment options

Answer selected by swathy-z9q
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants