You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it
Feature request
Add a max_chunk_size parameter to the HTMLHeaderTextSplitter class in langchain_text_splitters. This parameter would allow users to specify a maximum size for text chunks, ensuring that large sections of HTML content between headers are split into smaller, more manageable pieces.
Motivation
When working with large HTML documents (especially if there are few sections but per section, it has large text), I've encountered situations where the HTMLHeaderTextSplitter produces excessively large chunks of text, particularly when there are large sections of content between headers. This can cause issues with:
Memory usage: Very large chunks can consume excessive memory.
Model token limits: Large chunks may exceed the token limits of language models. (for smaller models)
Processing efficiency: Oversized chunks can slow down subsequent processing steps.
Currently, there's no built-in way to control the maximum size of these chunks within the HTMLHeaderTextSplitter class. Adding a max_chunk_size parameter would address these issues and provide more control over the splitting process.
Proposal (If applicable)
Implement the max_chunk_size feature as follows:
Add a max_chunk_size parameter to the HTMLHeaderTextSplitter constructor with a default value (e.g., 4000 characters).
Check the size of each chunk after the initial header-based splitting.
If a chunk exceeds max_chunk_size, use a RecursiveCharacterTextSplitter (or similar) to further split the chunk while preserving metadata.
Ensure that the splitting process maintains the original header structure and metadata as much as possible.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Checked
Feature request
Add a
max_chunk_size
parameter to theHTMLHeaderTextSplitter
class inlangchain_text_splitters
. This parameter would allow users to specify a maximum size for text chunks, ensuring that large sections of HTML content between headers are split into smaller, more manageable pieces.Motivation
When working with large HTML documents (especially if there are few sections but per section, it has large text), I've encountered situations where the
HTMLHeaderTextSplitter
produces excessively large chunks of text, particularly when there are large sections of content between headers. This can cause issues with:Memory usage: Very large chunks can consume excessive memory.
Model token limits: Large chunks may exceed the token limits of language models. (for smaller models)
Processing efficiency: Oversized chunks can slow down subsequent processing steps.
Currently, there's no built-in way to control the maximum size of these chunks within the HTMLHeaderTextSplitter class. Adding a max_chunk_size parameter would address these issues and provide more control over the splitting process.
Proposal (If applicable)
Implement the
max_chunk_size
feature as follows:Add a max_chunk_size parameter to the
HTMLHeaderTextSplitter
constructor with a default value (e.g., 4000 characters).Check the size of each chunk after the initial header-based splitting.
If a chunk exceeds
max_chunk_size
, use aRecursiveCharacterTextSplitter
(or similar) to further split the chunk while preserving metadata.Ensure that the splitting process maintains the original header structure and metadata as much as possible.
Example usage:
Beta Was this translation helpful? Give feedback.
All reactions