how to convert scanned pdf to searchable pdf #2474

Laxmi530 · 2023-06-16T12:16:02Z

Laxmi530
Jun 16, 2023

Hai,
Actually, I am trying to convert a scanned pdf or non-searchable pdf to searchable pdf by using pytesseract. I am able to do if the pdf is one page pdf when it becomes multiple page it is doing only the last page.
Can someone please help me is there any method available in PyMuPDF. Below is the sample code and I am using.

file = r"C:\Users\name\Downloads\Scanned_pdf.pdf"
file_pdf = open('Converted_searchable.pdf', 'ab')
text_data = []
with fitz.open(file) as pdf:
    for page in pdf:
        _, _, img = pdf_text_box_image(page)
        text_data.append(pytesseract.image_to_pdf_or_hocr(img))
file_pdf.write(bytearray(item for item in text_data))
file_pdf.close()

Thanking you advance.

JorjMcKie · 2023-06-16T13:51:26Z

JorjMcKie
Jun 16, 2023
Maintainer

There is more than one option:

You can use the imprtable version of OCRmyPDF
You can use PyMuPDF's built-in Tesseract support as follows:

import fitz

DPI = 150  # desired resolution

src = fitz.open("input.pdf")
doc = fitz.open()  # output PDF with text layer

for page in src:
    pix = page.get_pixmap(dpi=DPI)
    imgpdf = fitz.open("pdf", pix.pdfocr_tobytes())  # make 1-page temp PDF with text layer
    doc.insert_pdf(imgpdf)  # append page
    imgpdf.close()

doc.save("input-ocr.pdf")

10 replies

Laxmi530 Jun 19, 2023
Author

Thanks @JorjMcKie that will be a great work. Please let me know once you update the library.

Thanking you in advance.

SkaarFacee Jul 1, 2024

Adding to this I wanted to know if we have scope to use a OCR other than tesseract say maybe easyOCR or paddleOCR ?

JorjMcKie Jul 1, 2024
Maintainer

No - only Tesseract has an in-built API in (py-) MuPDF.

SkaarFacee Jul 2, 2024

Hey @JorjMcKie 👋🏻, thanks for the reply. Are there any plans to make it possible to input an raw ocr response and then be able to continue the task ?

JorjMcKie Jul 2, 2024
Maintainer

There are no such plans at the time being.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

how to convert scanned pdf to searchable pdf #2474

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 10 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

how to convert scanned pdf to searchable pdf #2474

Uh oh!

Uh oh!

Laxmi530 Jun 16, 2023

Replies: 1 comment · 10 replies

Uh oh!

JorjMcKie Jun 16, 2023 Maintainer

Uh oh!

Laxmi530 Jun 19, 2023 Author

Uh oh!

SkaarFacee Jul 1, 2024

Uh oh!

JorjMcKie Jul 1, 2024 Maintainer

Uh oh!

SkaarFacee Jul 2, 2024

Uh oh!

JorjMcKie Jul 2, 2024 Maintainer

Laxmi530
Jun 16, 2023

Replies: 1 comment 10 replies

JorjMcKie
Jun 16, 2023
Maintainer

Laxmi530 Jun 19, 2023
Author

JorjMcKie Jul 1, 2024
Maintainer

JorjMcKie Jul 2, 2024
Maintainer