A way to detect if a page has text saved as geometry and needs OCR #3558
Unanswered
idoglanz
asked this question in
Looking for help
Replies: 1 comment 5 replies
-
There is no precise formula to detect this ... only heuristics. So much for elegance 🤷♂️ 😒.
If there is text on page given as drawings paths, we can at least enumerate a number of necessary - albeit not sufficient - conditions:
mport pymupdf
doc = pymupdf.open()
page = doc.new_page()
page.insert_text((100,100), "Some text")
svg = page.get_svg_image() # get SVG from page
svgdoc = pymupdf.open("svg",svg.encode()) # document from SVG source
pdfdata = svgdoc.convert_to_pdf() # make a PDF from it again
svgpdf = pymupdf.open("pdf", pdfdata)
svgpage=svgpdf[0]
svgpaths=svgpage.get_drawings() # ok: we should expect exactly 8 paths: 1 for each character
for p in svgpaths:
print(p["rect"].width, p["rect"].height) # width and height
6.302978515625 8.404083251953125 # capital "S"
5.2139892578125 6.1819610595703125 # "o", etc.
7.6120147705078125 5.9288482666015625
5.2030792236328125 6.1819610595703125
2.6400604248046875 7.601104736328125
5.2030792236328125 6.1819610595703125
5.01593017578125 5.76385498046875
2.6400604248046875 7.601097106933594 |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi! first of all this library is great and extremely useful!
Something I encountered and was wondering if you have an elegant solution for.
A lot of our PDFs have text saved in weird fonts, and as such the text is saved as geometry and not text.
To solve this we just run OCR and use the text extracted.
My question is, can you think of an efficient way to detect if a page needs OCR? (i.e. it has text saved as geometry). One brute force way could be to run it anyhow, find some words and search for them in the original TextPage, but I'm trying to save on the initial OCR.
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions