Add `max_chunk_length` to `SemanticChunker`. #18802

kennylajara · 2024-03-08T15:25:26Z

kennylajara
Mar 8, 2024

Checked

I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it

Feature request

Add a way to define the max length of the chunks produced by the SemanticChunker.

Motivation

I split a huge document into Chunks using the SemanticChunker and then, made some queries in my program that uses OpenAI API and documents stored in the database to generate a prompt and got the following error message.

openai.BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 4097 tokens, however you requested 25687 tokens (25175 in your prompt; 512 for the completion). Please reduce your prompt; or completion length.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

This is because the chunks selected from the database were too long (probably because the author of the text had too much to say about this topic). So, defining a max chunk length would help to prevent that.

Proposal (If applicable)

Right now, I am solving it with this subclass. It does the same as the original SemanticChunker but, in the end, it splits each chunk longer than the max_chunk_length into sentences. Then, it combines sentences to make a chunk as close as possible to max_chunk_length w/out exceeding it. When the sentence is about to make the chunk longer than max_chunk_length, it starts a new chunk and combines the sentences in the new chunk.

It is probably not a bad strategy, but maybe it is not the best way to solve it. Probably the best option would be something that splits the chunks longer than max_chunk_length using some criteria more related to the meaning of the text. some recursive call of the SemantincChunker, maybe?

from langchain_experimental.text_splitter import SemanticChunker, BreakpointThresholdType


# This regex is the original `split_text` method of the `SemanticChunker` class.
SENTENCE_SPLITTER_REGEX = r"(?<=[.?!])\s+"

class SemanticChunkerWithMaxChunkLength(SemanticChunker):
    def __init__(
        self,
        embeddings: Embeddings,
        add_start_index: bool = False,
        breakpoint_threshold_type: BreakpointThresholdType = "percentile",
        breakpoint_threshold_amount: Optional[float] = None,
        number_of_chunks: Optional[int] = None,
        max_chunk_length: Optional[int] = None,
    ):
        super().__init__(
            embeddings=embeddings,
            add_start_index=add_start_index,
            breakpoint_threshold_type=breakpoint_threshold_type,
            breakpoint_threshold_amount=breakpoint_threshold_amount,
            number_of_chunks=number_of_chunks,
        )
        self.max_chunk_length = max_chunk_length

    def split_text(
        self,
        text: str,
    ) -> List[str]:
        chunks = super().split_text(text)

        if not self.max_chunk_length:
            return chunks

        # Modify chunk creation with max_chunk_length check
        final_chunks = []
        for chunk in chunks:
            if len(chunk) > self.max_chunk_length:
                final_chunks.extend(self.split_chunk_by_length(chunk))
            else:
                final_chunks.append(chunk)

        return final_chunks

    def split_chunk_by_length(self, chunk: str) -> List[str]:
        # Splitting the chunk into sentences
        sentences = re.split(SENTENCE_SPLITTER_REGEX, chunk)
        new_chunks = []
        current_chunk = []

        # Check no sentence is longer than the max_chunk_length
        longer_sentence_length = max(len(sentence) for sentence in sentences)
        if longer_sentence_length > self.max_chunk_length:
            raise ValueError(
                f"Got a sentence longer than `max_chunk_length`: {longer_sentence_length}"
            )

        for sentence in sentences:
            # Check if adding the next sentence exceeds the max_chunk_length
            if len(' '.join(current_chunk + [sentence])) <= self.max_chunk_length:
                current_chunk.append(sentence)
            else:
                # If current_chunk is not empty, save it as a new chunk
                if current_chunk:
                    new_chunks.append(' '.join(current_chunk))
                # Start a new chunk with the current sentence
                current_chunk = [sentence]

        # Add the last chunk if it exists
        if current_chunk:
            new_chunks.append(' '.join(current_chunk))

        return new_chunks

fcasadome · 2024-03-10T13:59:53Z

fcasadome
Mar 10, 2024

Exactly the issue i'm having... as soon as you try to index enough documents, the SemanticChunker will produce contexts that exceed the LLM window:

BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 8192 tokens. However, your messages resulted in 11421 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

I guess for now I just need to use a different LLM with bigger context window.

0 replies

likeitsystems01 · 2024-05-27T15:06:08Z

likeitsystems01
May 27, 2024

After facing the same problem, one solution could be to use the parameter breakpoint_threshold_amount for example, if you are using the breakpoint_threshold_type="percentile" then you can try to setup a treshold of 85,80. The lower the breakpoint, the lower the chunk size...

1 reply

huangpan2507 Aug 8, 2024

After facing the same problem, one solution could be to use the parameter breakpoint_threshold_amount for example, if you are using the breakpoint_threshold_type="percentile" then you can try to setup a treshold of 85,80. The lower the breakpoint, the lower the chunk size...

Hi, @likeitsystems01 , I wonder the parameter about breakpoint_threshold_type, the value can be interquartile，standard_deviation，percentile gradient，the percentile parameter can use with breakpoint_threshold_amount, but other three parameters seems can't to combine with breakpoint_threshold_amount.
When I use breakpoint_threshold_amount with the other three parameters (interquartile standard_deviation gradient), and then I adjust the breakpoint_threshold_amount parameter，it seems doesn't work， so， how to combine breakpoint_threshold_amount with other three parameters？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `max_chunk_length` to `SemanticChunker`. #18802

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Add max_chunk_length to SemanticChunker. #18802

Uh oh!

kennylajara Mar 8, 2024

Checked

Feature request

Motivation

Proposal (If applicable)

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

fcasadome Mar 10, 2024

Uh oh!

likeitsystems01 May 27, 2024

Uh oh!

huangpan2507 Aug 8, 2024

Add `max_chunk_length` to `SemanticChunker`. #18802

kennylajara
Mar 8, 2024

Replies: 2 comments 1 reply

fcasadome
Mar 10, 2024

likeitsystems01
May 27, 2024