PDF with OCR does not extract #2182
-
Hello Python code using docling library is as below:
Do I need to set any additional parameter? Thanks for the help. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
To ensure OCR works for scanned/image-based PDFs, set Also, make sure you have an OCR engine installed (EasyOCR, Tesseract, or RapidOCR) [source]. For better compatibility with problematic PDFs, set the PDF backend to Here’s an example of how to configure your pipeline: from docling.datamodel.pipeline_options import PdfPipelineOptions, AcceleratorOptions
import multiprocessing
pdf_options = PdfPipelineOptions(
do_code_enrichment=True,
do_ocr=True,
do_table_structure=True,
do_picture_classification=False,
accelerator_options=AcceleratorOptions(
num_threads=multiprocessing.cpu_count(),
device='auto'
),
pdf_backend='pypdfium2'
)
pdf_options.ocr_options.force_full_page_ocr = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_options)
},
) If you still have issues, check for error logs or warnings during extraction to spot missing dependencies or misconfigurations. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
Thank you for quick feedback.
|
Beta Was this translation helpful? Give feedback.
-
instead of
I got
is this correct? |
Beta Was this translation helpful? Give feedback.
You're right—
pdf_backend
is not a parameter ofPdfPipelineOptions
in docling 2.50.0. Instead, backend selection is handled outside of the pipeline options. If you're using the CLI, you can set the backend with the--pdf-backend=pypdfium2
flag.For Python code, you should specify the backend when creating your
PdfFormatOption
, like this: