-
Hi there, Please could someone help on how I can optimise use of Docling for a low resource machine? At the moment, whilst very accurate, PDF parsing takes 5 minutes with all the default settings. How can I speed this up? Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 18 replies
-
@timif2 Good to see this question coming up 😃 . There are several things you can do to improve the performance, depending on the use case you have. The pipeline features, ordered from most expensive to cheapest: OCR, table structure recognition, PDF parsing. My recommendations are:
Full API code sample: pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False # pick what you need
pipeline_options.do_table_structure = False # pick what you need
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options, backend=DoclingParseV2DocumentBackend) # switch to beta PDF backend
}
)
conv_result = doc_converter.convert(input_doc_path)
print(conv_result.document.export_to_markdown()) |
Beta Was this translation helpful? Give feedback.
-
yes same issue if we have a multiple pages in one pdf ,in terms of table format,and i want table layout recognization,good accuracy i want extracted data as well etc... or many things.. etc.. so overall i want speed,fast preprocessing ,how can i do that.. |
Beta Was this translation helpful? Give feedback.
-
Hi I have the similar question too. I'm running
and here is how the error looks like,
Are there any option in docling-serve to improve the performance? |
Beta Was this translation helpful? Give feedback.
-
There are many strategies. I did the items already mentioned, but added those below and sped up the process 4x. Still slow though . . . Preprocess the PDF to remove all pages I don't want to parse: e.g. title page, toc, preface, index, etc. Note that I'm saving the TOC as a json file and using that to resolve the "all headings are ##" issue. Now I have perfect heading levels.
I break the job into two and process in parallel. Also set threads.
|
Beta Was this translation helpful? Give feedback.
@timif2 Good to see this question coming up 😃 .
There are several things you can do to improve the performance, depending on the use case you have. The pipeline features, ordered from most expensive to cheapest: OCR, table structure recognition, PDF parsing. My recommendations are:
--no-ocr
DoclingParseV2DocumentBackend
(beta), which speeds up PDF loading by ~10x, with good impact o…