Extracting Content from PDF with Docling - Need Guidance #1237
Unanswered
nikunjgoel95
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi folks! 👋
I’m working on extracting content from a PDF and have tried a few different methods available online, such as Unstructured and Azure AI. Recently, Docling caught my attention, and I’ve been experimenting with it to parse documents and get annotations using smolvlm_picture_description. My goal is to put the annotations back into the same spot in the document.
For now, I’m focusing on this basic setup, but I plan to improve it in the future. Any guidance or suggestions would be greatly appreciated!
⸻
My Current Approach:
Here’s the code snippet I’m working with:
`
from pathlib import Path
from docling import DocumentConverter, InputFormat, PdfFormatOption
from docling.options import (
PipelineOptions, TesseractCliOcrOptions, AcceleratorOptions, AcceleratorDevice
)
input_doc = Path("Random file.pdf")
pipeline_options = PipelineOptions()
Enabling various features
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.do_code_enrichment = True
pipeline_options.do_formula_enrichment = True
pipeline_options.do_picture_classification = True
pipeline_options.do_picture_description = True
pipeline_options.force_backend_text = True
pipeline_options.table_structure_options.do_cell_matching = True
Image-related options
pipeline_options.images_scale = 2.0
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
Picture description model configuration
pipeline_options.picture_description_options = smolvlm_picture_description
pipeline_options.picture_description_options.prompt = (
"Describe the image in three sentences. Be concise and accurate."
)
OCR configuration
ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
pipeline_options.ocr_options = ocr_options
Accelerator configuration
accelerator_options = AcceleratorOptions(
num_threads=8, device=AcceleratorDevice.CUDA
)
pipeline_options.accelerator_options = accelerator_options
Creating a document converter
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
)
}
)
Enable profiling to measure time spent
settings.debug.profile_pipeline_timings = True
Convert the document
conversion_result = converter.convert(input_doc)
doc = conversion_result.document
Print the conversion results
md = doc.export_to_markdown()
print(md)
List with total time per document
doc_conversion_secs = conversion_result.timings["pipeline_total"].times
print(f"Conversion secs: {doc_conversion_secs}")
`
⸻
Questions:
1. Is there a better way to put the annotations back in the same spot in the PDF?
2. Any suggestions on optimizing performance or improving the annotation accuracy?
3. How can I effectively use images as part of the prompt? Specifically:
• How can I use Docling with chunking while still getting the base_64 image back with the chunks?
Your help or insights would be greatly appreciated!
Thanks in advance! 🙏
⸻
Beta Was this translation helpful? Give feedback.
All reactions