I'm getting tons of empty OCR results with RapidOCR... using docling 2.57.... please help! #2666

nikdulac · 2025-11-20T17:42:05Z

nikdulac
Nov 20, 2025

2025-11-20 17:04:48 - docling.models.rapid_ocr_model - WARNING - RapidOCR returned empty result!
[33m[WARNING] 2025-11-20 17:04:48,236 [RapidOCR] main.py:123: The text detection result is empty�[0m
2025-11-20 17:04:57 - docling.models.rapid_ocr_model - WARNING - RapidOCR returned empty result!

Using these settings for pdf / pptx docs... running on a massed instance A6000 instance 48/96GB... it's a lot of highly visual investment presentations. Not much real scanned content. It averages ~4sec per page.

Current Docling Configuration

Device & Threading:
device: cuda (GPU acceleration via CUDA)
num_threads: 8 (CPU threads for parallel processing)
use_threaded_pdf: true (using ThreadedPdfPipelineOptions for GPU)

OCR Settings:
do_ocr: true (OCR enabled)
backend: onnxruntime (RapidOCR with ONNX Runtime - NOT torch)
force_full_page_ocr: false (only OCR when needed)

Batch Sizes (GPU optimization):
ocr_batch_size: 32
layout_batch_size: 64
table_batch_size: 4

Image Extraction:
generate_picture_images: true (extract images for vision processing)
images_scale: 1.0 (reduced from 1.5 for speed)
do_picture_description: false (no AI image descriptions)

@nikdulac · 2025-11-20T17:50:37Z

dosubot[bot]
bot Nov 20, 2025

Hi @nikdulac! I'm Dosu and I’m helping the docling team.

You're seeing empty OCR results because RapidOCR (and most OCR engines) are designed for scanned/image-based documents, not digital PDFs or PPTX files that already contain embedded, selectable text. When you run OCR on these digital documents, the engine tries to find text in images, but since the text is already present as a text layer, it finds nothing—hence the empty results and warnings. This is expected behavior for these tools [discussion].

For digital PDFs or PPTX files, you should disable OCR (do_ocr: false) and use a backend like PyPdfiumDocumentBackend to extract the embedded text directly. This will avoid unnecessary OCR runs and empty results. If you need to process scanned/image-based documents, enable OCR only for those files.

If you do need OCR for scanned images and want GPU acceleration, the recommended setup is to use the torch backend for RapidOCR, not onnxruntime. The latest docling update enables GPU for RapidOCR with the torch backend:

pipeline_options = PdfPipelineOptions()
pipeline_options.ocr_options = RapidOcrOptions(
    backend="torch",
)

This is currently the only setup known to work reliably for GPU acceleration with RapidOCR [PR #2659].

If you want to force OCR on every page (not usually recommended for digital docs), set force_full_page_ocr: true, but this will still not extract embedded text—just run OCR on the rendered page images.

Summary: For digital presentations, disable OCR and extract text directly; for scanned images, use the torch backend for GPU-accelerated OCR. Let me know if you need a code example for either workflow!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

1 reply

nikdulac Nov 20, 2025
Author

Okay this is super useful - thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I'm getting tons of empty OCR results with RapidOCR... using docling 2.57.... please help! #2666

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

I'm getting tons of empty OCR results with RapidOCR... using docling 2.57.... please help! #2666

Uh oh!

nikdulac Nov 20, 2025

Replies: 1 comment · 1 reply

Uh oh!

dosubot[bot] bot Nov 20, 2025

Uh oh!

nikdulac Nov 20, 2025 Author

nikdulac
Nov 20, 2025

Replies: 1 comment 1 reply

dosubot[bot]
bot Nov 20, 2025

nikdulac Nov 20, 2025
Author