Document extraction does not throw error, just stuck #2012
Replies: 1 comment 1 reply
-
The When extraction hangs without errors, it's usually due to resource exhaustion (RAM, CPU), thread/process deadlocks, or expensive features like OCR or table structure running on large documents. Docling doesn't have a built-in progress bar or heartbeat; it only logs progress at batch/document completion and (in debug mode) after page batches. If logs stop updating for a long time, that's your main sign the process is stuck. To make extraction more robust for large jobs: set the number of threads to match your CPU cores, explicitly set the device to from docling.datamodel.pipeline_options import PdfPipelineOptions, AcceleratorOptions
import multiprocessing
pdf_options = PdfPipelineOptions(
do_code_enrichment=False,
do_ocr=False,
do_table_structure=False,
do_picture_classification=False,
accelerator_options=AcceleratorOptions(
num_threads=multiprocessing.cpu_count(),
device='auto'
)
) Or via CLI:
If you have many large files, process them in smaller batches or one at a time to avoid resource exhaustion. For very large PDFs, consider splitting them or running extraction in parallel subprocesses (one per file). To monitor for silent stops, increase logging verbosity ( If you need distributed or large-scale extraction, For more details and code examples, see the docling discussion on large file extraction and performance. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Document extraction does not throw error, just stuck.
I am trying to extract document in loop, like entire folder and sub-folders.
I am trying to extract document in MB having 100+ pages. It gives multiple warning message as below:
I waited for an hour, how should I come to know that it silently stopped?
Beta Was this translation helpful? Give feedback.
All reactions