Trouble parsing a pdf brokerage statement #1899

MichaelSzczepaniak · 2025-07-04T23:20:29Z

MichaelSzczepaniak
Jul 4, 2025

I'm trying to parse a statement I get every month from Vanguard for my IRA. The pdf I'm experimenting with has 10 pages and contains tables on all but the 1st and last 2 pages which I'd like to pull values out of. When I run the following code:

# try my local pdf
pdf_path = "./statements/2025/2025-04-30Vanguard_Trad.pdf"
converter = DocumentConverter()
result = converter.convert(pdf_path)
print(result.document.export_to_markdown())

I get the following:

<!-- image -->

<!-- image -->

<!-- image -->

<!-- image -->

<!-- image -->

<!-- image -->

<!-- image -->

<!-- image -->

<!-- image -->

<!-- image -->

<!-- image -->

<!-- image -->

<!-- image -->

<!-- image -->

When I replace the last line with:

print(result.document.export_to_text())

I get:

Parameter strict_text has been deprecated and will be ignored.

When I replace the last line with:

print(result.document.export_to_html())

I get:

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>2025-04-30Vanguard_Trad</title>
<meta name="generator" content="Docling HTML Serializer">
<style>
    html {
        background-color: #f5f5f5;
        font-family: Arial, sans-serif;
        line-height: 1.6;
    }
    body {
        max-width: 800px;
        margin: 0 auto;
        padding: 2rem;
        background-color: white;
        box-shadow: 0 0 10px rgba(0,0,0,0.1);
    }
    h1, h2, h3, h4, h5, h6 {
        color: #333;
        margin-top: 1.5em;
        margin-bottom: 0.5em;
    }
    h1 {
        font-size: 2em;
        border-bottom: 1px solid #eee;
        padding-bottom: 0.3em;
    }
    table {
        border-collapse: collapse;
        margin: 1em 0;
        width: 100%;
    }
    th, td {
        border: 1px solid #ddd;
        padding: 8px;
        text-align: left;
    }
    th {
        background-color: #f2f2f2;
        font-weight: bold;
    }
    figure {
        margin: 1.5em 0;
        text-align: center;
    }
    figcaption {
        color: #666;
        font-style: italic;
        margin-top: 0.5em;
    }
    img {
        max-width: 100%;
        height: auto;
    }
    pre {
        background-color: #f6f8fa;
        border-radius: 3px;
        padding: 1em;
        overflow: auto;
    }
    code {
        font-family: monospace;
        background-color: #f6f8fa;
        padding: 0.2em 0.4em;
        border-radius: 3px;
    }
    pre code {
        background-color: transparent;
        padding: 0;
    }
    .formula {
        text-align: center;
        padding: 0.5em;
        margin: 1em 0;
        background-color: #f9f9f9;
    }
    .formula-not-decoded {
        text-align: center;
        padding: 0.5em;
        margin: 1em 0;
        background: repeating-linear-gradient(
            45deg,
            #f0f0f0,
            #f0f0f0 10px,
            #f9f9f9 10px,
            #f9f9f9 20px
        );
    }
    .page-break {
        page-break-after: always;
        border-top: 1px dashed #ccc;
        margin: 2em 0;
    }
    .key-value-region {
        background-color: #f9f9f9;
        padding: 1em;
        border-radius: 4px;
        margin: 1em 0;
    }
    .key-value-region dt {
        font-weight: bold;
    }
    .key-value-region dd {
        margin-left: 1em;
        margin-bottom: 0.5em;
    }
    .form-container {
        border: 1px solid #ddd;
        padding: 1em;
        border-radius: 4px;
        margin: 1em 0;
    }
    .form-item {
        margin-bottom: 0.5em;
    }
    .image-classification {
        font-size: 0.9em;
        color: #666;
        margin-top: 0.5em;
    }
</style>
</head>
<body>
<div class='page'>

</div>
</body>
</html>

When I view this pdf, I can select, copy and paste values into another file, but that's obviously really tedious, time-consuming and what is motivating my trying this library. Any suggestions as to how to get the values in this doc parsed with the actual values?

Answered by dosubot[bot]

Jul 4, 2025

Hi @MichaelSzczepaniak! I'm Dosu and I’m helping the docling team.

This usually happens because Docling's default PDF backend (DoclingParseV4DocumentBackend) sometimes fails silently on certain PDFs, especially those from financial institutions. Since your PDF has selectable text, you can often fix this by explicitly switching to the PyPdfiumDocumentBackend, which is more robust for direct text extraction.

Here's how you can do it:

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend

View full answer

@MichaelSzczepaniak · 2025-07-04T23:24:50Z

dosubot[bot]
bot Jul 4, 2025

Hi @MichaelSzczepaniak! I'm Dosu and I’m helping the docling team.

This usually happens because Docling's default PDF backend (DoclingParseV4DocumentBackend) sometimes fails silently on certain PDFs, especially those from financial institutions. Since your PDF has selectable text, you can often fix this by explicitly switching to the PyPdfiumDocumentBackend, which is more robust for direct text extraction.

Here's how you can do it:

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend

pdf_path = "./statements/2025/2025-04-30Vanguard_Trad.pdf"

pipeline_options = PdfPipelineOptions(
    do_ocr=False,  # Set True if you suspect any scanned/image pages
    do_table_structure=True  # Enables table extraction
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,
            backend=PyPdfiumDocumentBackend
        )
    }
)

result = converter.convert(pdf_path)
print(result.document.export_to_markdown())

If tables are still not extracted correctly, try tweaking pipeline_options, for example: set table_structure_options.do_cell_matching = False to use the table structure model's predicted cells, which can help with complex tables.

If your PDF is actually image-based (scanned), set do_ocr=True and install an OCR engine like EasyOCR (pip install easyocr) or Tesseract. For more on OCR and pipeline options, see this discussion.

If you still get empty output, enable debug logging to see if there are backend errors, or try with a different PDF to rule out file-specific issues.

Let me know if you need help with any of these steps!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trouble parsing a pdf brokerage statement #1899

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Trouble parsing a pdf brokerage statement #1899

Uh oh!

Uh oh!

MichaelSzczepaniak Jul 4, 2025

Replies: 1 comment

Uh oh!

dosubot[bot] bot Jul 4, 2025

MichaelSzczepaniak
Jul 4, 2025

dosubot[bot]
bot Jul 4, 2025