Optimizations to the Docling Codebase #1739
agodinezmm2007
started this conversation in
Show and tell
Replies: 2 comments 2 replies
-
Are you interested in doing any freelance work for a project I am working on? We wouldn't be analyzing academic papers but rather mortgage loan documents, pictures, pay history, etc. Essentially a document refinery. Put in virtually any type of content and recieved beautifully structured data/database records created/flattened & cleaned pdf's. Organized, separated, collated, etc. |
Beta Was this translation helpful? Give feedback.
1 reply
-
It’s both. Equally important in my use case. There’s also a refinement piece to it as well where we are splitting bulk docs, orienting/collating, finding page ranges, and finally created nice clean PDF’s as output regardless of file input type (aside from pictures).Dominic McFadinOn Aug 1, 2025, at 12:11 AM, agodinezmm2007 ***@***.***> wrote:
Sounds interesting, is it more text extraction or context extraction/classification?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello all,
Thank you for open sourcing this library. Docling is a central component of a project I have been working on. I have used docling to extract textual data from PDF articles for integration into graph databases, considering i was previously using Azure Document Intelligence for this the cost reduced dramatically to $0. There were various bottlenecks however, which necessitated multiple modifications and optimizations to ensure multi GPU usage and increase throughput. I have created a detailed report of the integration of Docling into my project, the strengths of docling, its weaknesses, and how I addressed or attempted to rectify those weaknesses. I would like to do pull requests for all of the files I have modified, however many of the files were heavily refactored. Please visit the following section for my analysis of docling:
https://agodinezmm2007.github.io/project_portfolio/05-technical-report.html#stage-4-content-extraction-via-document-layout-analysis
This video is sped up 8X but at 14:20 you can observe doclings' performance https://youtu.be/ZCy5ESJ1gVE?si=ADU436JwfM5pvLod&t=862
Docling processed 100 academic journal PDF articles in 3 minutes, which includes text, tables, and formula extraction. One of the reasons for the speed bump is because instead of waiting for SmolDocling to reach its token limit when stuck regenerating/spamming characters I included logic within code_formula_predictor.py to detect these loops and stop them. Of course, this results in missing some formulas but for now its as good I could get it to perform.
Beta Was this translation helpful? Give feedback.
All reactions