Trying to extract large file (in MB) #2008

enx-github · 2025-07-29T03:56:03Z

enx-github
Jul 29, 2025

I am trying to extract files in loop. There are some file in KB and some file in MB. When I tried to extract individually it works file, for approx 12MB file, docling takes roughly half an hour to extract. but for the same file, when I tried to extract with other files and folders, it hangs for 3 hours and later I dismiss the task.

Can someone guide me how to overcome this situation?

Answered by dosubot[bot]

Jul 29, 2025

You can control the number of worker threads (and device) in your Python code by passing an AcceleratorOptions instance to PdfPipelineOptions. Here’s how you can adapt your code:

import multiprocessing
from docling.datamodel.pipeline_options import PdfPipelineOptions, AcceleratorOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption

pdf_options = PdfPipelineOptions(
    do_code_enrichment=True,
    do_ocr=True,
    do_table_structure=True,
    do_picture_classification=True,
    accelerator_options=AcceleratorOptions(
        num_threads=multiprocessing.cpu_count(),  # Use all CPU cores
        device='auto'

View full answer

enx-github · 2025-07-29T04:18:26Z

dosubot[bot]
bot Jul 29, 2025

Docling extraction can be slow or hang with large files or batches mainly because the CLI processes files sequentially and some features (like table structure extraction, OCR, and enrichment) are resource-intensive. When you process multiple large files together, this can exhaust system resources or hit bottlenecks, especially if expensive features are enabled or thread count is low.

To improve performance and avoid hangs:

Use the CLI options --num-threads to set the number of threads (ideally matching your CPU cores) and --device to select the accelerator (auto, cpu, cuda, or mps). For example:
```
docling convert --num-threads 8 --device cuda input.pdf
```
This uses 8 threads and GPU acceleration if available, which can significantly speed up extraction for large files [source].
Disable non-essential or expensive features in your pipeline options, such as table cell matching, code enrichment, formula enrichment, picture classification, and picture description. This reduces processing time and memory usage [source].
If you’re running into hangs with many files, try processing them in smaller batches or even one at a time. This helps avoid resource exhaustion and makes it easier to spot problematic files.

For large-scale or parallel extraction, consider using docling-serve, which is designed for distributed processing. You can run it with GPU support using Docker Compose and set the number of worker threads via environment variables like DOCLING_SERVE_ENG_LOC_NUM_WORKERS. Example Docker Compose setup for GPU:

services:
  docling-serve:
    image: ghcr.io/docling-project/docling-serve-cu124
    ports:
      - 5001:5001
    environment:
      - DOCLING_SERVE_ENABLE_UI=true
      - DOCLING_SERVE_ENG_LOC_NUM_WORKERS=4
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Then start with:

docker compose -f compose-gpu.yaml up -d

[source]

Monitor extraction times using CLI logs or enable profiling to identify bottlenecks.
If you use docling-serve, leverage its async API endpoints to submit large jobs and poll for results, which is more robust for long-running tasks [source].

In summary: increase threads, use GPU if possible, disable expensive features, process in smaller batches, and consider docling-serve for parallel/distributed workloads. This should help you extract large files much more efficiently and avoid hangs.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

4 replies

enx-github Jul 29, 2025
Author

@dosu
Thank you for quick reply. I am using basic extract options

        pdf_options = PdfPipelineOptions(
            # this will help to extract code block from the document
            do_code_enrichment=True,
            # this will help to get the text from OCR mode
            do_ocr=True,
            # this will extract proper table structure
            do_table_structure=True,
            # this will add picture tag / text in document
            do_picture_classification=True,
        )  

        converter = DocumentConverter(
            format_options={
                InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_options)
            },
        )

How do I need to change workers in python code? Appreciate your help.

dosubot[bot] bot Jul 29, 2025

You can control the number of worker threads (and device) in your Python code by passing an AcceleratorOptions instance to PdfPipelineOptions. Here’s how you can adapt your code:

import multiprocessing
from docling.datamodel.pipeline_options import PdfPipelineOptions, AcceleratorOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption

pdf_options = PdfPipelineOptions(
    do_code_enrichment=True,
    do_ocr=True,
    do_table_structure=True,
    do_picture_classification=True,
    accelerator_options=AcceleratorOptions(
        num_threads=multiprocessing.cpu_count(),  # Use all CPU cores
        device='auto'  # Or 'cuda' for NVIDIA GPU, 'mps' for Apple Silicon, 'cpu' for CPU only
    )
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_options)
    },
)

num_threads sets how many threads Docling will use for extraction. A good default is multiprocessing.cpu_count() to match your CPU cores.
device can be 'auto' (let Docling pick), 'cuda' (NVIDIA GPU), 'mps' (Apple Silicon GPU), or 'cpu'.

You can also override the thread count with the DOCLING_NUM_THREADS or OMP_NUM_THREADS environment variables if you don’t set it in code.
See the AcceleratorOptions definition for more details.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Answer selected by enx-github

enx-github Jul 29, 2025
Author

Thank you for the support. If I am using Windows 11, these options will work? How should I clean temporary memory?

enx-github Jul 31, 2025
Author

@dosu
If I want to use DOCX, XLSX, PPTX, how to set pipeline options? and what the best setting to extract the Office files?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trying to extract large file (in MB) #2008

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Trying to extract large file (in MB) #2008

Uh oh!

enx-github Jul 29, 2025

Replies: 1 comment · 4 replies

Uh oh!

dosubot[bot] bot Jul 29, 2025

Uh oh!

Uh oh!

enx-github Jul 29, 2025 Author

Uh oh!

dosubot[bot] bot Jul 29, 2025

Uh oh!

Uh oh!

enx-github Jul 29, 2025 Author

Uh oh!

enx-github Jul 31, 2025 Author

enx-github
Jul 29, 2025

Replies: 1 comment 4 replies

dosubot[bot]
bot Jul 29, 2025

enx-github Jul 29, 2025
Author

enx-github Jul 29, 2025
Author

enx-github Jul 31, 2025
Author