Document extraction does not throw error, just stuck #2012

enx-github · 2025-07-29T09:59:49Z

enx-github
Jul 29, 2025

Document extraction does not throw error, just stuck.
I am trying to extract document in loop, like entire folder and sub-folders.
I am trying to extract document in MB having 100+ pages. It gives multiple warning message as below:

XXXX\.venv\Lib\site-packages\torch\utils\data\dataloader.py:665: UserWarning: 'pin_memory' argument is set as true but no accelerator is found, then device pinned memory won't be used.
  warnings.warn(warn_msg)

I waited for an hour, how should I come to know that it silently stopped?

enx-github · 2025-07-29T10:10:32Z

dosubot[bot]
bot Jul 29, 2025

The 'pin_memory' warning from PyTorch is just informational—it means your extraction is running on CPU, so pinned memory (a GPU optimization) isn't used. This warning doesn't cause hangs or stalls.

When extraction hangs without errors, it's usually due to resource exhaustion (RAM, CPU), thread/process deadlocks, or expensive features like OCR or table structure running on large documents. Docling doesn't have a built-in progress bar or heartbeat; it only logs progress at batch/document completion and (in debug mode) after page batches. If logs stop updating for a long time, that's your main sign the process is stuck.

To make extraction more robust for large jobs: set the number of threads to match your CPU cores, explicitly set the device to 'cpu' or 'auto', and disable non-essential features (like code enrichment, OCR, table structure, picture classification) to reduce memory and processing time. For example, in Python:

from docling.datamodel.pipeline_options import PdfPipelineOptions, AcceleratorOptions
import multiprocessing

pdf_options = PdfPipelineOptions(
    do_code_enrichment=False,
    do_ocr=False,
    do_table_structure=False,
    do_picture_classification=False,
    accelerator_options=AcceleratorOptions(
        num_threads=multiprocessing.cpu_count(),
        device='auto'
    )
)

Or via CLI:

docling convert --num-threads 8 --device cpu --no-ocr --no-table-structure --no-code-enrichment input_folder/

If you have many large files, process them in smaller batches or one at a time to avoid resource exhaustion. For very large PDFs, consider splitting them or running extraction in parallel subprocesses (one per file).

To monitor for silent stops, increase logging verbosity (-v or -vv on the CLI) and watch for log messages like "Finished converting document X in Y sec." If logs stop for a long time, the process is likely stuck. You can also use a simple watchdog script to check if the log file is still being updated.

If you need distributed or large-scale extraction, docling-serve with multiple workers and GPU support is recommended for better robustness and monitoring.

For more details and code examples, see the docling discussion on large file extraction and performance.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

1 reply

enx-github Jul 29, 2025
Author

Thank you for the reply

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Document extraction does not throw error, just stuck #2012

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Document extraction does not throw error, just stuck #2012

Uh oh!

enx-github Jul 29, 2025

Replies: 1 comment · 1 reply

Uh oh!

dosubot[bot] bot Jul 29, 2025

Uh oh!

enx-github Jul 29, 2025 Author

enx-github
Jul 29, 2025

Replies: 1 comment 1 reply

dosubot[bot]
bot Jul 29, 2025

enx-github Jul 29, 2025
Author