`HTMLHeaderTextSplitter` class of `langchain_text_splitters` don't have chunk size #25657

gnsepili · 2024-08-22T07:28:43Z

gnsepili
Aug 22, 2024

Checked

I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it

Feature request

Add a max_chunk_size parameter to the HTMLHeaderTextSplitter class in langchain_text_splitters. This parameter would allow users to specify a maximum size for text chunks, ensuring that large sections of HTML content between headers are split into smaller, more manageable pieces.

Motivation

When working with large HTML documents (especially if there are few sections but per section, it has large text), I've encountered situations where the HTMLHeaderTextSplitter produces excessively large chunks of text, particularly when there are large sections of content between headers. This can cause issues with:

Memory usage: Very large chunks can consume excessive memory.
Model token limits: Large chunks may exceed the token limits of language models. (for smaller models)
Processing efficiency: Oversized chunks can slow down subsequent processing steps.

Currently, there's no built-in way to control the maximum size of these chunks within the HTMLHeaderTextSplitter class. Adding a max_chunk_size parameter would address these issues and provide more control over the splitting process.

Proposal (If applicable)

Implement the max_chunk_size feature as follows:

Add a max_chunk_size parameter to the HTMLHeaderTextSplitter constructor with a default value (e.g., 4000 characters).
Check the size of each chunk after the initial header-based splitting.
If a chunk exceeds max_chunk_size, use a RecursiveCharacterTextSplitter (or similar) to further split the chunk while preserving metadata.
Ensure that the splitting process maintains the original header structure and metadata as much as possible.

Example usage:

html_chunks = HTMLHeaderTextSplitter(
    headers_to_split_on=[("h1", "Header 1"), ("h2", "Header 2")],
    max_chunk_size=2000
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`HTMLHeaderTextSplitter` class of `langchain_text_splitters` don't have chunk size #25657

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

HTMLHeaderTextSplitter class of langchain_text_splitters don't have chunk size #25657

Uh oh!

gnsepili Aug 22, 2024

Checked

Feature request

Motivation

Proposal (If applicable)

Replies: 0 comments

`HTMLHeaderTextSplitter` class of `langchain_text_splitters` don't have chunk size #25657

gnsepili
Aug 22, 2024