Trouble parsing a pdf brokerage statement #1899
-
I'm trying to parse a statement I get every month from Vanguard for my IRA. The pdf I'm experimenting with has 10 pages and contains tables on all but the 1st and last 2 pages which I'd like to pull values out of. When I run the following code:
I get the following:
When I replace the last line with:
I get:
When I replace the last line with:
I get:
When I view this pdf, I can select, copy and paste values into another file, but that's obviously really tedious, time-consuming and what is motivating my trying this library. Any suggestions as to how to get the values in this doc parsed with the actual values? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi @MichaelSzczepaniak! I'm Dosu and I’m helping the docling team. This usually happens because Docling's default PDF backend (DoclingParseV4DocumentBackend) sometimes fails silently on certain PDFs, especially those from financial institutions. Since your PDF has selectable text, you can often fix this by explicitly switching to the PyPdfiumDocumentBackend, which is more robust for direct text extraction. Here's how you can do it: from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
pdf_path = "./statements/2025/2025-04-30Vanguard_Trad.pdf"
pipeline_options = PdfPipelineOptions(
do_ocr=False, # Set True if you suspect any scanned/image pages
do_table_structure=True # Enables table extraction
)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
backend=PyPdfiumDocumentBackend
)
}
)
result = converter.convert(pdf_path)
print(result.document.export_to_markdown()) If tables are still not extracted correctly, try tweaking If your PDF is actually image-based (scanned), set If you still get empty output, enable debug logging to see if there are backend errors, or try with a different PDF to rule out file-specific issues. Let me know if you need help with any of these steps! To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Hi @MichaelSzczepaniak! I'm Dosu and I’m helping the docling team.
This usually happens because Docling's default PDF backend (DoclingParseV4DocumentBackend) sometimes fails silently on certain PDFs, especially those from financial institutions. Since your PDF has selectable text, you can often fix this by explicitly switching to the PyPdfiumDocumentBackend, which is more robust for direct text extraction.
Here's how you can do it: