Hello,
I’m working on processing a large number of loosely related PDF files—primarily financial statements such as balance sheets, income statements, and similar documents. In this project, I’m not defining a fixed ontology upfront; instead, I’m relying on the LLM to determine how to interpret and extract information from each document.
Given this use case, I’d like to know: What are the most optimal chunking configurations for this kind of unstructured, heterogeneous input?
Additionally, is there any documentation or best-practice guide that explains the trade-offs between using larger vs. smaller chunk sizes? I’m particularly interested in how chunk size impacts context retention, accuracy of entity/relation extraction, and overall performance when using LLMs for knowledge graph construction.
Any advice or references would be greatly appreciated!
Thanks in advance.