Trying to extract large file (in MB) #2008
-
I am trying to extract files in loop. There are some file in KB and some file in MB. When I tried to extract individually it works file, for approx 12MB file, docling takes roughly half an hour to extract. but for the same file, when I tried to extract with other files and folders, it hangs for 3 hours and later I dismiss the task. Can someone guide me how to overcome this situation? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
Docling extraction can be slow or hang with large files or batches mainly because the CLI processes files sequentially and some features (like table structure extraction, OCR, and enrichment) are resource-intensive. When you process multiple large files together, this can exhaust system resources or hit bottlenecks, especially if expensive features are enabled or thread count is low. To improve performance and avoid hangs:
In summary: increase threads, use GPU if possible, disable expensive features, process in smaller batches, and consider docling-serve for parallel/distributed workloads. This should help you extract large files much more efficiently and avoid hangs. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
You can control the number of worker threads (and device) in your Python code by passing an AcceleratorOptions instance to PdfPipelineOptions. Here’s how you can adapt your code: