how to programmatically determine text from one page is garbled? #3465

animebing · 2024-05-11T09:33:50Z

animebing
May 11, 2024

I am processing thousands of pdf files, in some of them, i find the text from some pages are garbled, i hope to detect it, then use OCR to get texts from these pages.

what i have done: based on https://github.com/pymupdf/PyMuPDF/issues/530 and https://github.com/pymupdf/PyMuPDF/issues/365, i know i can get fonts, then check whether it has /tounicode, i find it works sometimes, but for some fonts, it has no /tounicode, but the text is still normal. Is there something i miss to make it work?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

how to programmatically determine text from one page is garbled? #3465

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

how to programmatically determine text from one page is garbled? #3465

Uh oh!

animebing May 11, 2024

Replies: 0 comments

animebing
May 11, 2024