Optimizations to the Docling Codebase #1739

agodinezmm2007 · 2025-06-10T07:04:30Z

agodinezmm2007
Jun 10, 2025

Hello all,

Thank you for open sourcing this library. Docling is a central component of a project I have been working on. I have used docling to extract textual data from PDF articles for integration into graph databases, considering i was previously using Azure Document Intelligence for this the cost reduced dramatically to $0. There were various bottlenecks however, which necessitated multiple modifications and optimizations to ensure multi GPU usage and increase throughput. I have created a detailed report of the integration of Docling into my project, the strengths of docling, its weaknesses, and how I addressed or attempted to rectify those weaknesses. I would like to do pull requests for all of the files I have modified, however many of the files were heavily refactored. Please visit the following section for my analysis of docling:

https://agodinezmm2007.github.io/project_portfolio/05-technical-report.html#stage-4-content-extraction-via-document-layout-analysis

This video is sped up 8X but at 14:20 you can observe doclings' performance https://youtu.be/ZCy5ESJ1gVE?si=ADU436JwfM5pvLod&t=862

Docling processed 100 academic journal PDF articles in 3 minutes, which includes text, tables, and formula extraction. One of the reasons for the speed bump is because instead of waiting for SmolDocling to reach its token limit when stuck regenerating/spamming characters I included logic within code_formula_predictor.py to detect these loops and stop them. Of course, this results in missing some formulas but for now its as good I could get it to perform.

DLME2024 · 2025-07-31T06:22:38Z

DLME2024
Jul 31, 2025

Are you interested in doing any freelance work for a project I am working on? We wouldn't be analyzing academic papers but rather mortgage loan documents, pictures, pay history, etc. Essentially a document refinery. Put in virtually any type of content and recieved beautifully structured data/database records created/flattened & cleaned pdf's. Organized, separated, collated, etc.

1 reply

agodinezmm2007 Aug 1, 2025
Author

Sounds interesting, is it more text extraction or context extraction/classification?

DLME2024 · 2025-08-01T05:36:17Z

DLME2024
Aug 1, 2025

It’s both. Equally important in my use case. There’s also a refinement piece to it as well where we are splitting bulk docs, orienting/collating, finding page ranges, and finally created nice clean PDF’s as output regardless of file input type (aside from pictures).Dominic McFadinOn Aug 1, 2025, at 12:11 AM, agodinezmm2007 ***@***.***> wrote: Sounds interesting, is it more text extraction or context extraction/classification? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>

1 reply

agodinezmm2007 Aug 2, 2025
Author

It’s both. Equally important in my use case. There’s also a refinement piece to it as well where we are splitting bulk docs, orienting/collating, finding page ranges, and finally created nice clean PDF’s as output regardless of file input type (aside from pictures).Dominic McFadinOn Aug 1, 2025, at 12:11 AM, agodinezmm2007 @.> wrote: Sounds interesting, is it more text extraction or context extraction/classification? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>

Email me at [email protected] with project details. Send 30+ PDF samples (5 per format if diverse). Clone https://github.com/agodinezmm2007/docling_mod and test with scripts/sample.ipynb to see if it handles your document types effectively.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizations to the Docling Codebase #1739

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Optimizations to the Docling Codebase #1739

Uh oh!

Uh oh!

agodinezmm2007 Jun 10, 2025

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

DLME2024 Jul 31, 2025

Uh oh!

agodinezmm2007 Aug 1, 2025 Author

Uh oh!

DLME2024 Aug 1, 2025

Uh oh!

agodinezmm2007 Aug 2, 2025 Author

agodinezmm2007
Jun 10, 2025

Replies: 2 comments 2 replies

DLME2024
Jul 31, 2025

agodinezmm2007 Aug 1, 2025
Author

DLME2024
Aug 1, 2025

agodinezmm2007 Aug 2, 2025
Author