Extracting Content from PDF with Docling - Need Guidance #1237

nikunjgoel95 · 2025-03-24T20:48:14Z

nikunjgoel95
Mar 24, 2025

Hi folks! 👋

I’m working on extracting content from a PDF and have tried a few different methods available online, such as Unstructured and Azure AI. Recently, Docling caught my attention, and I’ve been experimenting with it to parse documents and get annotations using smolvlm_picture_description. My goal is to put the annotations back into the same spot in the document.

For now, I’m focusing on this basic setup, but I plan to improve it in the future. Any guidance or suggestions would be greatly appreciated!

⸻

My Current Approach:

Here’s the code snippet I’m working with:

`
from pathlib import Path
from docling import DocumentConverter, InputFormat, PdfFormatOption
from docling.options import (
PipelineOptions, TesseractCliOcrOptions, AcceleratorOptions, AcceleratorDevice
)

input_doc = Path("Random file.pdf")

pipeline_options = PipelineOptions()

Enabling various features

pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.do_code_enrichment = True
pipeline_options.do_formula_enrichment = True
pipeline_options.do_picture_classification = True
pipeline_options.do_picture_description = True
pipeline_options.force_backend_text = True
pipeline_options.table_structure_options.do_cell_matching = True

Image-related options

pipeline_options.images_scale = 2.0
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True

Picture description model configuration

pipeline_options.picture_description_options = smolvlm_picture_description
pipeline_options.picture_description_options.prompt = (
"Describe the image in three sentences. Be concise and accurate."
)

OCR configuration

ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
pipeline_options.ocr_options = ocr_options

Accelerator configuration

accelerator_options = AcceleratorOptions(
num_threads=8, device=AcceleratorDevice.CUDA
)
pipeline_options.accelerator_options = accelerator_options

Creating a document converter

converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
)
}
)

Enable profiling to measure time spent

settings.debug.profile_pipeline_timings = True

Convert the document

conversion_result = converter.convert(input_doc)
doc = conversion_result.document

Print the conversion results

md = doc.export_to_markdown()
print(md)

List with total time per document

doc_conversion_secs = conversion_result.timings["pipeline_total"].times
print(f"Conversion secs: {doc_conversion_secs}")
`

⸻

Questions:
1. Is there a better way to put the annotations back in the same spot in the PDF?
2. Any suggestions on optimizing performance or improving the annotation accuracy?
3. How can I effectively use images as part of the prompt? Specifically:
• How can I use Docling with chunking while still getting the base_64 image back with the chunks?

Your help or insights would be greatly appreciated!
Thanks in advance! 🙏

⸻

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extracting Content from PDF with Docling - Need Guidance #1237

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Extracting Content from PDF with Docling - Need Guidance #1237

Uh oh!

nikunjgoel95 Mar 24, 2025

Enabling various features

Image-related options

Picture description model configuration

OCR configuration

Accelerator configuration

Creating a document converter

Enable profiling to measure time spent

Convert the document

Print the conversion results

List with total time per document

Replies: 0 comments

nikunjgoel95
Mar 24, 2025