Deploying docling with docling-serve #1890

david1542 · 2025-07-03T13:18:11Z

david1542
Jul 3, 2025

Hi everyone,

We're very happy with docling's accuracy and we want to use it in production. We need to be able to process hundreds of documents every 5-10 minutes.

I've found docling-serve and I tried to stress-test it but the performance was quite slow. I'd love to know if I'm doing something wrong.

I've performed 5 different experiments - 1 locally (Macbook Pro M2, 12 cores, 32gb memory) and 4 in Google Cloud Run with GPU (1 GPU - Nvidia L4).

I've tried to process 5 PDF documents at once (using the UI's file upload feature), overall size of 6.2 MBs. PDF page counts are: 2, 3, 8, 13 and 101 pages. Overall 127 pages.

#	Name	Environment	vCPU	Memory	GPU?	# workers (env)	# of docs	Size of docs	Duration	Notes
1	Local run	Locally	12	20gb	No	2	5	6.2mbs	10m 28s
2	Google Cloud Run	Cloud	4	16gb	Yes, Nvidia L4	2	5	6.2mbs	4m 53s	Having a GPU seems to help, but only with a factor of ~ 2
3	Google Cloud Run	Cloud	8	32gb	Yes, Nvidia L4	2	5	6.2mbs	4m 37s	Experiment shows resources were under utilized since increasing vCPU and Memory didn’t help at all.
4	Google Cloud Run	Cloud	4	16gb	Yes, Nvidia L4	4	5	6.2mbs	4m 45s	Increasing `DOCLING_SERVE_ENG_LOC_NUM_WORKERS` to 4 didn’t seem to help
5	Google Cloud Run	Cloud	4	16gb	Yes, Nvidia L4	8	5	6.2mbs	4m 39s	Increasing `DOCLING_SERVE_ENG_LOC_NUM_WORKERS` to 8 didn’t seem to help

As you can see, the lowest duration was 4m 37s. I also tried to increase DOCLING_SERVE_ENG_LOC_NUM_WORKERS but that didn't seem to work.

I'd love to know how can I scale docling (and docling-serve maybe) to a point where it can process 500-1000 minutes under 1-2 minutes.

2025-07-03T13:20:57Z

dosubot[bot]
bot Jul 3, 2025

docling-serve's scaling is limited by its architecture: each instance runs an in-process async orchestrator with a pool of worker threads, controlled by DOCLING_SERVE_ENG_LOC_NUM_WORKERS. These workers process tasks using Python threads, but unless the workload is mostly native code (like numpy, torch, or GPU-bound), Python's GIL can limit parallelism. Increasing worker count or vCPUs often doesn't help if the bottleneck is in single-threaded code, I/O, or model inference that doesn't release the GIL. GPU helps, but only if the document processing pipeline is GPU-accelerated throughout (worker code).

To scale up, you can increase the number of Uvicorn worker processes (UVICORN_WORKERS), which creates more independent orchestrators. However, each orchestrator is isolated—there's no shared queue or state—so you may need sticky sessions or careful routing if you want to track tasks across requests (deployment docs). For very high throughput, deploy multiple pods/containers and load-balance across them.

Key config variables: set DOCLING_SERVE_ENG_LOC_NUM_WORKERS (threads per orchestrator) and UVICORN_WORKERS (processes per pod/container) via environment variables, not CLI flags, since Uvicorn ignores CLI options when spawning subprocesses (config docs). For distributed workloads, the KFP engine exists but is not yet production-ready.

If you see little improvement from more workers or hardware, the bottleneck may be in PDF parsing, I/O, or model inference. Profiling with logs, metrics, or external profilers can help pinpoint where time is spent. For GPU memory management, consider adding explicit cleanup (e.g., torch.cuda.empty_cache()) after each job if you see memory leaks (GPU memory issue).

For best throughput: maximize both UVICORN_WORKERS and DOCLING_SERVE_ENG_LOC_NUM_WORKERS to match your hardware, deploy multiple pods, and use sticky sessions if you need to track task state. If you need to process 500–1000 pages in 1–2 minutes, you may need to horizontally scale with more pods and possibly optimize the document processing pipeline itself.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deploying docling with docling-serve #1890

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Deploying docling with docling-serve #1890

Uh oh!

Uh oh!

david1542 Jul 3, 2025

Replies: 1 comment

Uh oh!

dosubot[bot] bot Jul 3, 2025

david1542
Jul 3, 2025

dosubot[bot]
bot Jul 3, 2025