Mingling Korean Doc Image with Korean Text #2372

cweiler-blumatix · 2023-04-26T09:12:25Z

cweiler-blumatix
Apr 26, 2023

Hi,

I got following issue: My current OCR does not support Korean. But it seems it supports text extraction from a text-pdf with Korean characters. Thus, I thought I might just use another OCR for Korean and create a text-pdf which my existing OCR will pick up.

Thus, I managed to:

Use easy ocr to retrieve Korean text out of an doc (png image)
Created with PyMuPDF a new PDF including the text

# Setup new page
page_pdf = doc.new_page(width=image.width, height=image.height)
page_rect = fitz.Rect(0, 0, image.width, image.height)
page_pdf.insert_image(page_rect, pixmap=image)
...
#font
font = fitz.Font(insert_text_font)
if insert_text_font == "cjk":
     page_pdf.insert_font(fontname="F0", fontbuffer=font.buffer)
     insert_text_font="F0"
...
#insert text
font_size_opt = select_font_size(bbox['Height'], bbox['Width'], text, font)
res = page_pdf.insert_textbox(rect, text, fontsize=font_size_opt, fontname=insert_text_font, align=1, stroke_opacity=instert_text_opacity, fill_opacity=instert_text_opacity)

Now I can copy and paste the text e.g. into a text editor out of PDF Acrobat Reader View successfully

However, my OCR does not pick it up? Do I miss something substantially?
I understand that for CJK a font is embedded - but somehow it seems my existing OCR does not pick up the embedded text at all?
Any help or idea is welcome.

Thanks

Answered by JorjMcKie

Apr 26, 2023

Ah sorry,did not fully understand your comment in the first place:
No: an OCR engine only interprets images. Standard text therefore is not taken into account.
In PyMuPDF, there are ways to deal with a micture of OCRed and standard text.

View full answer

JorjMcKie · 2023-04-26T09:50:33Z

JorjMcKie
Apr 26, 2023
Maintainer

What is your "current OCR"? easyocr? Why don't you try PyMuPDF's builtin interface to Tesseract-OCR (which must be installed of course). When invoking it, make sure to supply the language spec for Korean to it.
That too might be the problem of your current OCR engine.

0 replies

JorjMcKie · 2023-04-26T10:56:49Z

JorjMcKie
Apr 26, 2023
Maintainer

Ah sorry,did not fully understand your comment in the first place:
No: an OCR engine only interprets images. Standard text therefore is not taken into account.
In PyMuPDF, there are ways to deal with a micture of OCRed and standard text.

1 reply

cweiler-blumatix Apr 26, 2023
Author

Thanks - we use Nuance. And it can extract text layer from pdf and uses it. But not in this case. I see that it runs into an exception since it tries to create images that are huge. Maybe related to the embedded font. Not sure - but nothing for this discussion. Anyway thanks for your quick reply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mingling Korean Doc Image with Korean Text #2372

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Mingling Korean Doc Image with Korean Text #2372

Uh oh!

cweiler-blumatix Apr 26, 2023

Replies: 2 comments · 1 reply

Uh oh!

JorjMcKie Apr 26, 2023 Maintainer

Uh oh!

JorjMcKie Apr 26, 2023 Maintainer

Uh oh!

cweiler-blumatix Apr 26, 2023 Author

cweiler-blumatix
Apr 26, 2023

Replies: 2 comments 1 reply

JorjMcKie
Apr 26, 2023
Maintainer

JorjMcKie
Apr 26, 2023
Maintainer

cweiler-blumatix Apr 26, 2023
Author