Feat : adds MarkdownChunker, docs, tests#242
Feat : adds MarkdownChunker, docs, tests#242Josephrp wants to merge 8 commits intochonkie-inc:mainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @Josephrp, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces the MarkdownChunker to the chonkie library, a new utility designed for intelligent text splitting of Markdown content. This chunker leverages Markdown's structural elements like headings and separator lines to create meaningful chunks, with additional capabilities for cleaning inline formatting and ensuring chunks adhere to specified size limits. The changes include the core implementation, comprehensive documentation, and a dedicated test suite, enhancing the library's text processing capabilities for structured documents.
Highlights
- New MarkdownChunker Implementation: A robust MarkdownChunker class has been added, capable of splitting Markdown text based on structural elements like headings and horizontal rules, and recursively splitting larger sections to meet chunk_size requirements.
- Optional Text Cleaning: The MarkdownChunker includes an option to clean chunked text by stripping inline Markdown formatting (bold, italic, code), links, and URLs, making the output cleaner for downstream tasks.
- Comprehensive Documentation: A new documentation page (markdown-chunker.mdx) has been added, detailing the MarkdownChunker's purpose, parameters, usage examples, and internal workings. The main documentation navigation and overview page have also been updated.
- Dedicated Test Suite: A new set of unit tests (test_markdown_chunker.py) has been introduced to ensure the correct functionality, robustness, and parameter adherence of the MarkdownChunker.
- Package Integration: The MarkdownChunker has been properly integrated into the chonkie package structure, making it accessible from the top-level chonkie import.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
|
try the demo here : https://huggingface.co/spaces/Tonic/chonkie |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new MarkdownChunker, complete with documentation and tests. The implementation is well-structured and thoughtfully designed. My review focuses on a few areas for improvement: removing a piece of dead code, fixing a logic issue in a test helper function, enhancing test coverage for key features like separator splitting and combining small ranges, and improving documentation clarity. I've also flagged a file that may have been added unintentionally.
| def verify_chunk_indices(chunks: List[Chunk], original_text: str) -> None: | ||
| """Verify that chunk indices correctly map to the original text.""" | ||
| reconstructed = "" | ||
| for i, chunk in enumerate(chunks): | ||
| extracted = original_text[chunk.start_index : chunk.end_index] | ||
| assert chunk.text == extracted, ( | ||
| f"Chunk {i} text mismatch\n" | ||
| f"Chunk text: {repr(chunk.text)}\n" | ||
| f"Slice text: {repr(extracted)}\n" | ||
| f"Indices: [{chunk.start_index}:{chunk.end_index}]" | ||
| ) | ||
| reconstructed += chunk.text | ||
| assert reconstructed == original_text |
There was a problem hiding this comment.
The text reconstruction logic in this helper function is likely to fail. The MarkdownChunker is designed to skip separators (like =====), which means there will be gaps between chunks when they are reconstructed. Therefore, reconstructed == original_text will not hold true. The per-chunk assertion that chunk.text matches the slice from original_text is sufficient for verifying index correctness. I suggest removing the reconstruction logic.
def verify_chunk_indices(chunks: List[Chunk], original_text: str) -> None:
"""Verify that chunk indices correctly map to the original text."""
for i, chunk in enumerate(chunks):
extracted = original_text[chunk.start_index : chunk.end_index]
assert chunk.text == extracted, (
f"Chunk {i} text mismatch\n"
f"Chunk text: {repr(chunk.text)}\n"
f"Slice text: {repr(extracted)}\n"
f"Indices: [{chunk.start_index}:{chunk.end_index}]"
)There was a problem hiding this comment.
wrong ai interpretation of this , adds metadata to the chunk based on sources, see demo
| ranges.append((start, first_heading_start)) | ||
|
|
||
| # Add sentinel at end | ||
| matches_with_end = matches + [(end, 0)] |
There was a problem hiding this comment.
intentional , trust me , you'll need this later (maybe) if you want to add configurations for forward or backward looking passes ;-)
removes empty file artifact
|
hello my frens & @chonk-lain , some merge conflicts , which is normal and expected after literally 3 months under review. should i take care of it , or is it possible for you to take care of it ? btw let's close this as unplanned if it's not relevant anymore :-) with all my thanks ! |
|
hi @Josephrp |
|
it's just doumentations ;-) |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new MarkdownChunker, complete with documentation and tests. The overall contribution is valuable, but there are a few areas that need attention. The core implementation of the MarkdownChunker contains a critical bug in its heading-splitting logic that results in overlapping chunks and duplicated content. Additionally, the new documentation has some minor inconsistencies and omissions that should be addressed for clarity. I've provided detailed comments and suggestions to fix these issues.
| def _split_on_markdown_headings(self, text: str, start: int, end: int) -> List[Tuple[int, int]]: | ||
| """Split a section by Markdown headings with hierarchy-aware grouping. | ||
|
|
||
| For level 1 headings (#), collect content until next level 1. | ||
| For level >= 2, collect until next heading with level <= current. | ||
| Also emits a leading prelude if content exists before the first heading. | ||
| """ | ||
| section = text[start:end] | ||
| heading_re = re.compile(r"^(#{1,6})\s+.+$", flags=re.MULTILINE) | ||
|
|
||
| matches = [ | ||
| (start + m.start(), len(m.group(1))) | ||
| for m in heading_re.finditer(section) | ||
| ] | ||
|
|
||
| if not matches: | ||
| # No headings: return whole section (will be size-corrected later) | ||
| return [(start, end)] | ||
|
|
||
| ranges: List[Tuple[int, int]] = [] | ||
| # Prelude before first heading | ||
| first_heading_start = matches[0][0] | ||
| if first_heading_start > start and text[start:first_heading_start].strip(): | ||
| ranges.append((start, first_heading_start)) | ||
|
|
||
| # Add sentinel at end | ||
| matches_with_end = matches + [(end, 0)] | ||
|
|
||
| # Walk headings | ||
| for idx in range(len(matches)): | ||
| h_start, h_level = matches[idx] | ||
| # Find boundary depending on level | ||
| boundary = end | ||
| for j in range(idx + 1, len(matches)): | ||
| next_start, next_level = matches[j] | ||
| if h_level == 1 and next_level == 1: | ||
| boundary = next_start | ||
| break | ||
| if h_level >= 2 and next_level <= h_level: | ||
| boundary = next_start | ||
| break | ||
| # Emit heading section | ||
| if h_start < boundary and text[h_start:boundary].strip(): | ||
| ranges.append((h_start, boundary)) | ||
|
|
||
| return ranges |
There was a problem hiding this comment.
The current implementation of _split_on_markdown_headings has a critical flaw: it produces overlapping ranges. This leads to duplicated content in the final chunks, which can negatively impact downstream applications. The logic attempts a complex hierarchical grouping but fails to create disjoint sections.
I recommend replacing this with a simpler, more robust implementation that splits the text at each heading, ensuring the resulting ranges are always disjoint. This approach is correct and avoids data duplication. While it simplifies the grouping logic, correctness should be the priority. The docstring should also be updated to reflect this simpler behavior.
def _split_on_markdown_headings(self, text: str, start: int, end: int) -> List[Tuple[int, int]]:
"""Split a section by Markdown headings into disjoint ranges."""
section = text[start:end]
heading_re = re.compile(r"^(#{1,6})\s+.+$", flags=re.MULTILINE)
ranges: List[Tuple[int, int]] = []
last_split = start
for match in heading_re.finditer(section):
match_start = start + match.start()
if match_start > last_split:
if text[last_split:match_start].strip():
ranges.append((last_split, match_start))
last_split = match_start
if last_split < end:
if text[last_split:end].strip():
ranges.append((last_split, end))
if not ranges and section.strip():
return [(start, end)]
return ranges| heading_level=3, # max heading level to consider (1..6) | ||
| min_characters_per_chunk=50, | ||
| max_characters_per_section=4000, | ||
| clean_text=True # strip inline markdown and URLs from chunks |
There was a problem hiding this comment.
The initialization example shows clean_text=True, while the parameter description below correctly states that the default is False. This could be confusing for users. To improve clarity, I suggest adding a comment to the example to clarify that this is not the default value.
clean_text=True # default: False; strip inline markdown and URLs from chunks
| |-----------------|:-------:|:----------:|:-----:|:-------------:| | ||
| | TokenChunker | ✅ | ✅ | ✅ | ✅ | | ||
| | SentenceChunker | ✅ | ✅ | ✅ | ✅ | | ||
| | RecursiveChunker| ✅ | ✅ | ✅ | ✅ | |
There was a problem hiding this comment.
The new MarkdownChunker is missing from the 'Availability' table. Since it ships with the base library and has no extra dependencies, it should be marked as available for all installation options. I recommend adding it to the table for completeness, ideally after RecursiveChunker to maintain a logical order.
| RecursiveChunker| ✅ | ✅ | ✅ | ✅ |
| MarkdownChunker | ✅ | ✅ | ✅ | ✅ |
|
|
||
| <Note> | ||
| Chonkie follows a special approach to dependencies, keeping the base installation lightweight while allowing you to add extra features as and when needed. | ||
| Please check the [Installation](/installation) page for more details. |
There was a problem hiding this comment.
There's an inconsistent link to the Installation page. Here, it's [Installation](/installation), but on line 160, a card links to /python-sdk/getting-started/installation. The latter seems more likely to be correct given the file's location in the directory structure. Please update this link for consistency.
Please check the [Installation](/python-sdk/getting-started/installation) page for more details.
| return "" | ||
| cleaned = chunk | ||
| # Drop common prompt lines | ||
| # cleaned = re.sub(r"^(?:Search Results for:|Question:|Query:).*$", "", cleaned, flags=re.MULTILINE) |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the new MarkdownChunker, complete with implementation, tests, and documentation. The chunker itself is well-structured, following a multi-stage process to split Markdown content intelligently. My review has identified a few areas for improvement. There are some correctness issues in the Markdown cleaning logic due to problematic regular expressions, and a critical flaw in the test suite's text reconstruction logic. I've also found several inconsistencies and omissions in the new documentation files. Additionally, a significant but unrelated change to src/chonkie/__init__.py introduces an incomplete and incorrect __all__ definition, which I recommend reverting or fixing as it could break package imports. Overall, this is a great feature addition, and with these fixes, it will be a solid contribution.
| # Add basic package metadata to __all__ | ||
| __all__ = [ | ||
| "__name__", | ||
| "__version__", | ||
| "__author__", | ||
| ] | ||
|
|
||
| # Add all data classes to __all__ | ||
| __all__ += [ | ||
| "Context", | ||
| "Chunk", | ||
| "RecursiveChunk", | ||
| "RecursiveLevel", | ||
| "RecursiveRules", | ||
| "SentenceChunk", | ||
| "SemanticChunk", | ||
| "Sentence", | ||
| "SemanticSentence", | ||
| "LateChunk", | ||
| "CodeChunk", | ||
| "LanguageConfig", | ||
| "MergeRule", | ||
| "SplitRule", | ||
| ] | ||
|
|
||
| # Add all tokenizer classes to __all__ | ||
| __all__ += [ | ||
| "Tokenizer", | ||
| "CharacterTokenizer", | ||
| "WordTokenizer", | ||
| ] | ||
|
|
||
| # Add all chunker classes to __all__ | ||
| __all__ += [ | ||
| "BaseChunker", | ||
| "TokenChunker", | ||
| "SentenceChunker", | ||
| "SemanticChunker", | ||
| "RecursiveChunker", | ||
| "LateChunker", | ||
| "CodeChunker", | ||
| "SlumberChunker", | ||
| "NeuralChunker", | ||
| "MarkdownChunker", | ||
| ] | ||
|
|
||
| # Add all cloud classes to __all__ | ||
| __all__ += [ | ||
| "auth", | ||
| "chunker", | ||
| "refineries", | ||
| ] | ||
|
|
||
| # Add all embeddings classes to __all__ | ||
| __all__ += [ | ||
| "BaseEmbeddings", | ||
| "Model2VecEmbeddings", | ||
| "SentenceTransformerEmbeddings", | ||
| "OpenAIEmbeddings", | ||
| "CohereEmbeddings", | ||
| "GeminiEmbeddings", | ||
| "AutoEmbeddings", | ||
| "JinaEmbeddings", | ||
| "VoyageAIEmbeddings", | ||
| ] | ||
|
|
||
| # Add all refinery classes to __all__ | ||
| __all__ += [ | ||
| "BaseRefinery", | ||
| "OverlapRefinery", | ||
| "EmbeddingsRefinery", | ||
| ] | ||
|
|
||
| # Add all utils classes to __all__ | ||
| __all__ += [ | ||
| "Hubbie", | ||
| "Visualizer", | ||
| ] | ||
|
|
||
| # Add all genie classes to __all__ | ||
| __all__ += [ | ||
| "BaseGenie", | ||
| "GeminiGenie", | ||
| "OpenAIGenie", | ||
| ] | ||
|
|
||
| # Add all friends classes to __all__ | ||
| __all__ += [ | ||
| "BasePorter", | ||
| "BaseHandshake", | ||
| "JSONPorter", | ||
| "ChromaHandshake", | ||
| "MongoDBHandshake", | ||
| "PgvectorHandshake", | ||
| "PineconeHandshake", | ||
| "QdrantHandshake", | ||
| "WeaviateHandshake", | ||
| "TurbopufferHandshake", | ||
| ] | ||
|
|
||
| # Add all the chefs to __all__ | ||
| __all__ += [ | ||
| "BaseChef", | ||
| "TextChef", | ||
| ] | ||
|
|
||
| # Add all the fetchers to __all__ | ||
| __all__ += [ | ||
| "BaseFetcher", | ||
| "FileFetcher", | ||
| ] |
There was a problem hiding this comment.
This large addition to define __all__ is both incomplete and incorrect. It references many names that are not imported into this __init__.py file, and it omits many that are. This will break wildcard imports (from chonkie import *) for users.
Given that this change is out of scope for adding a MarkdownChunker and introduces critical issues, I recommend removing it entirely from this pull request. It can be addressed in a separate PR where it can be implemented correctly and thoroughly.
| def verify_chunk_indices(chunks: List[Chunk], original_text: str) -> None: | ||
| """Verify that chunk indices correctly map to the original text.""" | ||
| reconstructed = "" | ||
| for i, chunk in enumerate(chunks): | ||
| extracted = original_text[chunk.start_index : chunk.end_index] | ||
| assert chunk.text == extracted, ( | ||
| f"Chunk {i} text mismatch\n" | ||
| f"Chunk text: {repr(chunk.text)}\n" | ||
| f"Slice text: {repr(extracted)}\n" | ||
| f"Indices: [{chunk.start_index}:{chunk.end_index}]" | ||
| ) | ||
| reconstructed += chunk.text | ||
| assert reconstructed == original_text | ||
|
|
There was a problem hiding this comment.
The assertion assert reconstructed == original_text is incorrect. The chunking process removes separators (like =====) and doesn't guarantee contiguous chunks, so concatenating them will not reproduce the original text. This will cause the test to fail.
The first assertion, which checks if chunk.text matches the slice from original_text, is a valid check for clean_text=False. The reconstruction logic should be removed.
def verify_chunk_indices(chunks: List[Chunk], original_text: str) -> None:
"""Verify that chunk indices correctly map to the original text."""
for i, chunk in enumerate(chunks):
extracted = original_text[chunk.start_index : chunk.end_index]
assert chunk.text == extracted, (
f"Chunk {i} text mismatch\n"
f"Chunk text: {repr(chunk.text)}\n"
f"Slice text: {repr(extracted)}\n"
f"Indices: [{chunk.start_index}:{chunk.end_index}]"
)| cleaned = re.sub(r"\*\*([^*]+)\*\*", r"\1", cleaned) | ||
| cleaned = re.sub(r"\*([^*]+)\*", r"\1", cleaned) | ||
| cleaned = re.sub(r"`([^`]+)`", r"\1", cleaned) |
There was a problem hiding this comment.
The regular expressions used to strip inline Markdown are not robust. Using [^*]+ will fail for nested formatting (e.g., **bold *and italic***). Using a non-greedy match (.*?) is a more reliable approach. Also, the order of replacement matters: ** should be processed before * to avoid incorrect substitutions.
| cleaned = re.sub(r"\*\*([^*]+)\*\*", r"\1", cleaned) | |
| cleaned = re.sub(r"\*([^*]+)\*", r"\1", cleaned) | |
| cleaned = re.sub(r"`([^`]+)`", r"\1", cleaned) | |
| cleaned = re.sub(r"\*\*(.*?)\*\*", r"\1", cleaned) | |
| cleaned = re.sub(r"\*(.*?)\*", r"\1", cleaned) | |
| cleaned = re.sub(r"`(.*?)`", r"\1", cleaned) |
| cleaned = re.sub(r"\*([^*]+)\*", r"\1", cleaned) | ||
| cleaned = re.sub(r"`([^`]+)`", r"\1", cleaned) | ||
| # Links and URLs | ||
| cleaned = re.sub(r"\[([^\]]+)\]\([^)]+\)", r"\1", cleaned) |
There was a problem hiding this comment.
The regex for stripping links \[([^\]]+)\]\([^)]+\) will fail if a URL contains parentheses, which is valid. Using a non-greedy match (.*?) for the URL part is more robust.
| cleaned = re.sub(r"\[([^\]]+)\]\([^)]+\)", r"\1", cleaned) | |
| cleaned = re.sub(r"\[([^\]]+)\]\(.*?\)", r"\1", cleaned) |
| heading_level=3, # max heading level to consider (1..6) | ||
| min_characters_per_chunk=50, | ||
| max_characters_per_section=4000, | ||
| clean_text=True # strip inline markdown and URLs from chunks |
There was a problem hiding this comment.
The initialization example shows clean_text=True, but the parameter description on line 39 and the actual implementation in src/chonkie/chunker/markdown.py both use False as the default. This inconsistency can be misleading for users. To ensure clarity, the example should reflect the actual default value.
clean_text=False # strip inline markdown and URLs from chunks
| | Chunker | Default | embeddings | 'all' | Chonkie Cloud | | ||
| |-----------------|:-------:|:----------:|:-----:|:-------------:| | ||
| | TokenChunker | ✅ | ✅ | ✅ | ✅ | | ||
| | SentenceChunker | ✅ | ✅ | ✅ | ✅ | | ||
| | RecursiveChunker| ✅ | ✅ | ✅ | ✅ | | ||
| | TableChunker | ✅ | ✅ | ✅ | ✅ | | ||
| | CodeChunker | ❌ | ✅ | ✅ | ✅ | | ||
| | SemanticChunker | ❌ | ✅ | ✅ | ✅ | | ||
| | SDPMChunker | ❌ | ✅ | ✅ | ✅ | | ||
| | LateChunker | ❌ | ✅ | ✅ | ✅ | | ||
| | NeuralChunker | ❌ | ✅ | ✅ | ✅ | | ||
| | SlumberChunker | ❌ | ✅ | ✅ | ✅ | |
There was a problem hiding this comment.
The MarkdownChunker is missing from the availability table. Since it's part of the base library with no extra dependencies, it should be included in the table to accurately reflect its availability across different installation options.
| Chunker | Default | embeddings | 'all' | Chonkie Cloud |
|-----------------|:-------:|:----------:|:-----:|:-------------:|
| TokenChunker | ✅ | ✅ | ✅ | ✅ |
| SentenceChunker | ✅ | ✅ | ✅ | ✅ |
| RecursiveChunker| ✅ | ✅ | ✅ | ✅ |
| MarkdownChunker | ✅ | ✅ | ✅ | ✅ |
| TableChunker | ✅ | ✅ | ✅ | ✅ |
| CodeChunker | ❌ | ✅ | ✅ | ✅ |
| SemanticChunker | ❌ | ✅ | ✅ | ✅ |
| SDPMChunker | ❌ | ✅ | ✅ | ✅ |
| LateChunker | ❌ | ✅ | ✅ | ✅ |
| NeuralChunker | ❌ | ✅ | ✅ | ✅ |
| SlumberChunker | ❌ | ✅ | ✅ | ✅ |
|
|
||
| <Note> | ||
| Chonkie follows a special approach to dependencies, keeping the base installation lightweight while allowing you to add extra features as and when needed. | ||
| Please check the [Installation](/installation) page for more details. |
|
|
||
| 1. Make a free account on [Chonkie Cloud](https://cloud.chonkie.ai) | ||
| 2. Get your API key | ||
| 3. Send your CHONK reqests! |
| ranges.append((start, first_heading_start)) | ||
|
|
||
| # Add sentinel at end | ||
| matches_with_end = matches + [(end, 0)] |
| return "" | ||
| cleaned = chunk | ||
| # Drop common prompt lines | ||
| # cleaned = re.sub(r"^(?:Search Results for:|Question:|Query:).*$", "", cleaned, flags=re.MULTILINE) |
|
lmk if you want me to do anything based on the above. in my opinion this automated review is itself poor quality and nonsensical , but dont take my word for it , just check the demo. for documentation points , i think it's out of scope here , so that's fine . you knkow during christmas & new years, it's not easy to really prioritize this , hope you understand , but it's cool y'all havent completely forgot about it or ignored it , currently we're using my fork in all the deep research agents we make. |
solves #241