Speed for low resource machine #245

timif2 · 2024-11-05T11:46:02Z

timif2
Nov 5, 2024

Hi there,

Please could someone help on how I can optimise use of Docling for a low resource machine? At the moment, whilst very accurate, PDF parsing takes 5 minutes with all the default settings. How can I speed this up?

Thank you!

Answered by cau-git

Nov 5, 2024

@timif2 Good to see this question coming up 😃 .

There are several things you can do to improve the performance, depending on the use case you have. The pipeline features, ordered from most expensive to cheapest: OCR, table structure recognition, PDF parsing. My recommendations are:

Turn off OCR if you don't need it for your data (e.g. you bring digital-only PDFs)
- CLI option --no-ocr
Turn of table structure recognition if you don't need table structure (e.g. your PDFs have no tables or you don't need the table's content)
- only possible in python API code, see below.
Switch the PDF backend to DoclingParseV2DocumentBackend (beta), which speeds up PDF loading by ~10x, with good impact o…

View full answer

cau-git · 2024-11-05T15:47:19Z

cau-git
Nov 5, 2024
Maintainer

@timif2 Good to see this question coming up 😃 .

There are several things you can do to improve the performance, depending on the use case you have. The pipeline features, ordered from most expensive to cheapest: OCR, table structure recognition, PDF parsing. My recommendations are:

Turn off OCR if you don't need it for your data (e.g. you bring digital-only PDFs)
- CLI option --no-ocr
Turn of table structure recognition if you don't need table structure (e.g. your PDFs have no tables or you don't need the table's content)
- only possible in python API code, see below.
Switch the PDF backend to DoclingParseV2DocumentBackend (beta), which speeds up PDF loading by ~10x, with good impact on the overall pipeline speed.
- CLI arg --pdf-backend= dlparse_v2

Full API code sample:

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False # pick what you need
pipeline_options.do_table_structure = False # pick what you need

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options, backend=DoclingParseV2DocumentBackend)  # switch to beta PDF backend
        }
)
conv_result = doc_converter.convert(input_doc_path)

print(conv_result.document.export_to_markdown())

16 replies

JonoReshefAltaML Jul 25, 2025

Can someone please explain what the import path of DoclingParseV2DocumentBackend is?

sujalthink41 Sep 5, 2025

Hey I tried it with long pdf of 170pages, it took hell lot of time around ~20min

alexisdrakopoulos Sep 7, 2025

It's still shockingly slow on larger documents, even with OCR off. 5+ minutes for a 400 pages for me.

MarioRicoIbanez Sep 9, 2025

Can anyone share which are the differences between the backends?? I have seen that there are multiple, V1, V2, V3, V4 but i found nothing on the documentation

hisan-ideamaker Sep 12, 2025

same question as @MarioRicoIbanez

Aadi-stack · 2025-07-11T05:02:23Z

Aadi-stack
Jul 11, 2025

yes same issue if we have a multiple pages in one pdf ,in terms of table format,and i want table layout recognization,good accuracy i want extracted data as well etc...

or many things.. etc..

so overall i want speed,fast preprocessing ,how can i do that..

0 replies

tjhoo · 2025-07-11T09:32:57Z

tjhoo
Jul 11, 2025

Hi

I have the similar question too. I'm running docling-serve on a AWS EC2 (4g.large) with 2 vCPU and 8 Memory (GiB).

docling-serve is able to process a PDF with 2 pages (it was slow but it completed successfully). However, it failed to process a PDF with 7 pages even though I have set 30 minutes,

$ export DOCLING_SERVE_MAX_SYNC_WAIT=1800
$ docling-serve run --enable-ui

and here is how the error looks like,

INFO:     175.139.225.206:52716 - "POST /v1alpha/convert/file HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/.local/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  ...
  File "/home/ec2-user/.local/lib/python3.11/site-packages/fastapi/routing.py", line 327, in app
    content = await serialize_response(
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/.local/lib/python3.11/site-packages/fastapi/routing.py", line 176, in serialize_response
    raise ResponseValidationError(
fastapi.exceptions.ResponseValidationError: 3 validation errors:
  {'type': 'missing', 'loc': ('response', 'document'), 'msg': 'Field required', 'input': HTTPException(status_code=504, detail='Conversion is taking too long. The maximum wait time is configure as DOCLING_SERVE_MAX_SYNC_WAIT=1800.')}
  {'type': 'missing', 'loc': ('response', 'status'), 'msg': 'Field required', 'input': HTTPException(status_code=504, detail='Conversion is taking too long. The maximum wait time is configure as DOCLING_SERVE_MAX_SYNC_WAIT=1800.')}
  {'type': 'missing', 'loc': ('response', 'processing_time'), 'msg': 'Field required', 'input': HTTPException(status_code=504, detail='Conversion is taking too long. The maximum wait time is configure as DOCLING_SERVE_MAX_SYNC_WAIT=1800.')}

Are there any option in docling-serve to improve the performance?

2 replies

psinojiya Jul 30, 2025

Hi @tjhoo , we are planning to run similar experiment, were you able to parse long PDFs a AWS EC2 (4g.large) with 2 vCPU and 8 Memory (GiB).

tjhoo Jul 31, 2025

Yes, and you have to ensure that only two concurrent requests at a time. It is going to take ~30 to ~40 seconds to convert a PDF with two pages.

teachamantofish · 2025-09-10T16:00:20Z

teachamantofish
Sep 10, 2025

There are many strategies. I did the items already mentioned, but added those below and sped up the process 4x. Still slow though . . .

Preprocess the PDF to remove all pages I don't want to parse: e.g. title page, toc, preface, index, etc.

Note that I'm saving the TOC as a json file and using that to resolve the "all headings are ##" issue. Now I have perfect heading levels.

import fitz  # PyMuPDF
import os, json

# === CONSTANTS ===
FILE = r"C:\tools\rag_docs\mifref\test.pdf"
FRONT_REMOVE = 0          # pages to remove from start
TOC = (1, 2)              # Extract these pages into toc.json (1-based inclusive)
REMOVE_MIDDLE = [(3, 9)]  # Remove these pages from reduced PDF (1-based inclusive)
BACK_REMOVE = 2          # pages to remove from end

I break the job into two and process in parallel. Also set threads.

def convert_in_two_jobs(src_pdf: str, final_out_dir: str):
    part1, part2 = _split_pdf_two_halves(src_pdf)
    # ensure final_out_dir exists and do not create extra subdirectories
    os.makedirs(final_out_dir, exist_ok=True)
    out1 = os.path.join(final_out_dir, "temp1.md")
    out2 = os.path.join(final_out_dir, "temp2.md")

0 replies

Speed for low resource machine #245

Uh oh!

Replies: 4 comments · 18 replies

Uh oh!

cau-git Nov 5, 2024 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 4 comments 18 replies

cau-git
Nov 5, 2024
Maintainer