Replies: 1 comment 1 reply
-
|
Hi @nikdulac! I'm Dosu and I’m helping the docling team. You're seeing empty OCR results because RapidOCR (and most OCR engines) are designed for scanned/image-based documents, not digital PDFs or PPTX files that already contain embedded, selectable text. When you run OCR on these digital documents, the engine tries to find text in images, but since the text is already present as a text layer, it finds nothing—hence the empty results and warnings. This is expected behavior for these tools [discussion]. For digital PDFs or PPTX files, you should disable OCR ( If you do need OCR for scanned images and want GPU acceleration, the recommended setup is to use the pipeline_options = PdfPipelineOptions()
pipeline_options.ocr_options = RapidOcrOptions(
backend="torch",
)This is currently the only setup known to work reliably for GPU acceleration with RapidOCR [PR #2659]. If you want to force OCR on every page (not usually recommended for digital docs), set Summary: For digital presentations, disable OCR and extract text directly; for scanned images, use the torch backend for GPU-accelerated OCR. Let me know if you need a code example for either workflow! To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
2025-11-20 17:04:48 - docling.models.rapid_ocr_model - WARNING - RapidOCR returned empty result!
[33m[WARNING] 2025-11-20 17:04:48,236 [RapidOCR] main.py:123: The text detection result is empty�[0m
2025-11-20 17:04:57 - docling.models.rapid_ocr_model - WARNING - RapidOCR returned empty result!
Using these settings for pdf / pptx docs... running on a massed instance A6000 instance 48/96GB... it's a lot of highly visual investment presentations. Not much real scanned content. It averages ~4sec per page.
Current Docling Configuration
Device & Threading:
device: cuda (GPU acceleration via CUDA)
num_threads: 8 (CPU threads for parallel processing)
use_threaded_pdf: true (using ThreadedPdfPipelineOptions for GPU)
OCR Settings:
do_ocr: true (OCR enabled)
backend: onnxruntime (RapidOCR with ONNX Runtime - NOT torch)
force_full_page_ocr: false (only OCR when needed)
Batch Sizes (GPU optimization):
ocr_batch_size: 32
layout_batch_size: 64
table_batch_size: 4
Image Extraction:
generate_picture_images: true (extract images for vision processing)
images_scale: 1.0 (reduced from 1.5 for speed)
do_picture_description: false (no AI image descriptions)
Beta Was this translation helpful? Give feedback.
All reactions