PDF with OCR does not extract #2182

enx-github · 2025-09-03T13:18:00Z

enx-github
Sep 3, 2025

Hello
sample.pdf
I have the sample PDF attached above, which was not extracted using either the docling library or docling-serve.

Python code using docling library is as below:

        pdf_options = PdfPipelineOptions(
            do_code_enrichment=True,
            # this will help to get the text from OCR mode
            do_ocr=True,
            # this will extract proper table structure
            do_table_structure=True,
            # if this is true then it wil have place holder in export text or markdown for image
            do_picture_classification=False,
            
            accelerator_options=AcceleratorOptions(
                num_threads=multiprocessing.cpu_count(),  # Use all CPU cores
                device='auto'  # Or 'cuda' for NVIDIA GPU, 'mps' for Apple Silicon, 'cpu' for CPU only
            )
        )
        
        converter = DocumentConverter(
            format_options={
                InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_options)
            },
        )

Do I need to set any additional parameter? Thanks for the help.

Answered by dosubot[bot]

Sep 3, 2025

You're right—pdf_backend is not a parameter of PdfPipelineOptions in docling 2.50.0. Instead, backend selection is handled outside of the pipeline options. If you're using the CLI, you can set the backend with the --pdf-backend=pypdfium2 flag.

For Python code, you should specify the backend when creating your PdfFormatOption, like this:

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.backends.pdf_backend import PyPdfiumDocumentBackend  # adjust import if needed

pdf_options = PdfPipelineOptions(
    do_code_enrichment=True,
    do_ocr

View full answer

2025-09-03T13:19:27Z

dosubot[bot]
bot Sep 3, 2025

To ensure OCR works for scanned/image-based PDFs, set ocr_options.force_full_page_ocr = True in your PdfPipelineOptions. This forces OCR to run even if a text layer is present, which is often necessary for proper extraction from scanned documents [source].

Also, make sure you have an OCR engine installed (EasyOCR, Tesseract, or RapidOCR) [source]. For better compatibility with problematic PDFs, set the PDF backend to 'pypdfium2' by adding pdf_backend='pypdfium2' to your pipeline options [source].

Here’s an example of how to configure your pipeline:

from docling.datamodel.pipeline_options import PdfPipelineOptions, AcceleratorOptions
import multiprocessing

pdf_options = PdfPipelineOptions(
    do_code_enrichment=True,
    do_ocr=True,
    do_table_structure=True,
    do_picture_classification=False,
    accelerator_options=AcceleratorOptions(
        num_threads=multiprocessing.cpu_count(),
        device='auto'
    ),
    pdf_backend='pypdfium2'
)
pdf_options.ocr_options.force_full_page_ocr = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_options)
    },
)

If you still have issues, check for error logs or warnings during extraction to spot missing dependencies or misconfigurations.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

enx-github · 2025-09-03T13:45:26Z

enx-github
Sep 3, 2025
Author

Thank you for quick feedback.
I have updated my docling version to 2.50.0
I did not get below parameter for "PdfPipelineOptions".

pdf_backend='pypdfium2'

1 reply

dosubot[bot] bot Sep 3, 2025

You're right—pdf_backend is not a parameter of PdfPipelineOptions in docling 2.50.0. Instead, backend selection is handled outside of the pipeline options. If you're using the CLI, you can set the backend with the --pdf-backend=pypdfium2 flag.

For Python code, you should specify the backend when creating your PdfFormatOption, like this:

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.backends.pdf_backend import PyPdfiumDocumentBackend  # adjust import if needed

pdf_options = PdfPipelineOptions(
    do_code_enrichment=True,
    do_ocr=True,
    do_table_structure=True,
    do_picture_classification=False,
    # ... other options ...
)
pdf_options.ocr_options.force_full_page_ocr = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pdf_options,
            backend=PyPdfiumDocumentBackend
        )
    }
)

This approach lets you use the pypdfium2 backend in your code, even though pdf_backend isn't a direct parameter of PdfPipelineOptions in this version reference.

For best OCR results, keep do_ocr=True and set force_full_page_ocr = True as above reference. Make sure your OCR engine is installed and available.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Answer selected by enx-github

enx-github · 2025-09-03T14:02:43Z

enx-github
Sep 3, 2025
Author

instead of

from docling.backends.pdf_backend import PyPdfiumDocumentBackend  # adjust import if needed

I got

from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend  # adjust import if needed

is this correct?

1 reply

dosubot[bot] bot Sep 3, 2025

Yes, your import is correct for docling 2.50.0:

from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend

The pypdfium2_backend.py file is located in the docling/backend directory, so this is the right path to use for importing PyPdfiumDocumentBackend in your version reference.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PDF with OCR does not extract #2182

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

PDF with OCR does not extract #2182

Uh oh!

Uh oh!

enx-github Sep 3, 2025

Replies: 3 comments · 2 replies

Uh oh!

dosubot[bot] bot Sep 3, 2025

Uh oh!

enx-github Sep 3, 2025 Author

Uh oh!

dosubot[bot] bot Sep 3, 2025

Uh oh!

enx-github Sep 3, 2025 Author

Uh oh!

dosubot[bot] bot Sep 3, 2025

enx-github
Sep 3, 2025

Replies: 3 comments 2 replies

dosubot[bot]
bot Sep 3, 2025

enx-github
Sep 3, 2025
Author

enx-github
Sep 3, 2025
Author