Efficiently rename header handling in ExperimentalMarkdownSyntaxTextSplitter for improved LLM understanding #26970

david101-hunter · 2024-09-29T04:05:42Z

david101-hunter
Sep 29, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

DEFAULT_HEADER_KEYS = {
    "#": "Header 1",
    "##": "Header 2",
    "###": "Header 3",
    "####": "Header 4",
    "#####": "Header 5",
    "######": "Header 6",
}

Description

Background

I'm working with the ExperimentalMarkdownSyntaxTextSplitter from the LangChain library to prepare text for processing by an LLM (Language Model). I want to improve how the splitter handles Markdown headers (h1 to h6) to enhance the LLM's understanding of document structure.

In practice, I've found through experimentation that it's not very good yet, as it often provides incomplete answers when the response requires multiple steps,...

Current Approach

Currently, I'm using the default ExperimentalMarkdownSyntaxTextSplitter without any customizatio. In markdown.py of langchain_text_splliters (of langchain library).

DEFAULT_HEADER_KEYS = {
    "#": "Header 1",
    "##": "Header 2",
    "###": "Header 3",
    "####": "Header 4",
    "#####": "Header 5",
    "######": "Header 6",
}

Desired Outcome

I want to refine the splitter to better recognize and handle headers from h1 to h6, potentially by customizing the separator patterns and I also want to know any ways to achieve improving llm answers.

Question

How can I modify the ExperimentalMarkdownSyntaxTextSplitter to improve its handling of Markdown headers (h1 to h6)? Specifically:

What custom separator patterns should I use to effectively capture all header levels?
Are there any other configuration options I should consider to enhance the splitting process for better LLM understanding?
How can I verify that the new splitting method is actually improving the document structure for the LLM?
How should we prompt the LLM to focus on these changes?

System Info

Additional Information

LangChain version: 0.3.1
Python version: 3.11.9
Operating System: Ubuntu 20.04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Efficiently rename header handling in ExperimentalMarkdownSyntaxTextSplitter for improved LLM understanding #26970

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Efficiently rename header handling in ExperimentalMarkdownSyntaxTextSplitter for improved LLM understanding #26970

Uh oh!

david101-hunter Sep 29, 2024

Checked other resources

Commit to Help

Example Code

Description

Background

Current Approach

Desired Outcome

Question

System Info

Additional Information

Replies: 0 comments

david101-hunter
Sep 29, 2024