Skip to content

Guidance on Optimal Chunking Configuration for LLM-Based Processing of Financial PDFs #1370

@igelfenbeyn

Description

@igelfenbeyn

Hello,

I’m working on processing a large number of loosely related PDF files—primarily financial statements such as balance sheets, income statements, and similar documents. In this project, I’m not defining a fixed ontology upfront; instead, I’m relying on the LLM to determine how to interpret and extract information from each document.

Given this use case, I’d like to know: What are the most optimal chunking configurations for this kind of unstructured, heterogeneous input?

Additionally, is there any documentation or best-practice guide that explains the trade-offs between using larger vs. smaller chunk sizes? I’m particularly interested in how chunk size impacts context retention, accuracy of entity/relation extraction, and overall performance when using LLMs for knowledge graph construction.

Any advice or references would be greatly appreciated!

Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions