Feat : adds MarkdownChunker, docs, tests by Josephrp · Pull Request #242 · chonkie-inc/chonkie

Josephrp · 2025-08-13T10:54:19Z

solves #241

gemini-code-assist

Summary of Changes

Hello @Josephrp, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the MarkdownChunker to the chonkie library, a new utility designed for intelligent text splitting of Markdown content. This chunker leverages Markdown's structural elements like headings and separator lines to create meaningful chunks, with additional capabilities for cleaning inline formatting and ensuring chunks adhere to specified size limits. The changes include the core implementation, comprehensive documentation, and a dedicated test suite, enhancing the library's text processing capabilities for structured documents.

Highlights

New MarkdownChunker Implementation: A robust MarkdownChunker class has been added, capable of splitting Markdown text based on structural elements like headings and horizontal rules, and recursively splitting larger sections to meet chunk_size requirements.
Optional Text Cleaning: The MarkdownChunker includes an option to clean chunked text by stripping inline Markdown formatting (bold, italic, code), links, and URLs, making the output cleaner for downstream tasks.
Comprehensive Documentation: A new documentation page (markdown-chunker.mdx) has been added, detailing the MarkdownChunker's purpose, parameters, usage examples, and internal workings. The main documentation navigation and overview page have also been updated.
Dedicated Test Suite: A new set of unit tests (test_markdown_chunker.py) has been introduced to ensure the correct functionality, robustness, and parameter adherence of the MarkdownChunker.
Package Integration: The MarkdownChunker has been properly integrated into the chonkie package structure, making it accessible from the top-level chonkie import.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist · 2025-08-13T10:56:26Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

Josephrp · 2025-08-13T10:57:16Z

try the demo here : https://huggingface.co/spaces/Tonic/chonkie
code for demo : https://huggingface.co/spaces/Tonic/chonkie/blob/main/app.py
please dm me here if you wanna send me some swag: https://twitter.com/Josephpollack (i really wanna a t-shirt lmfao)

Josephrp · 2025-08-13T12:00:11Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a new MarkdownChunker, complete with documentation and tests. The implementation is well-structured and thoughtfully designed. My review focuses on a few areas for improvement: removing a piece of dead code, fixing a logic issue in a test helper function, enhancing test coverage for key features like separator splitting and combining small ranges, and improving documentation clarity. I've also flagged a file that may have been added unintentionally.

gemini-code-assist · 2025-08-13T12:01:45Z

tests/chunkers/test_markdown_chunker.py

+def verify_chunk_indices(chunks: List[Chunk], original_text: str) -> None:
+    """Verify that chunk indices correctly map to the original text."""
+    reconstructed = ""
+    for i, chunk in enumerate(chunks):
+        extracted = original_text[chunk.start_index : chunk.end_index]
+        assert chunk.text == extracted, (
+            f"Chunk {i} text mismatch\n"
+            f"Chunk text: {repr(chunk.text)}\n"
+            f"Slice text: {repr(extracted)}\n"
+            f"Indices: [{chunk.start_index}:{chunk.end_index}]"
+        )
+        reconstructed += chunk.text
+    assert reconstructed == original_text


The text reconstruction logic in this helper function is likely to fail. The MarkdownChunker is designed to skip separators (like =====), which means there will be gaps between chunks when they are reconstructed. Therefore, reconstructed == original_text will not hold true. The per-chunk assertion that chunk.text matches the slice from original_text is sufficient for verifying index correctness. I suggest removing the reconstruction logic.

def verify_chunk_indices(chunks: List[Chunk], original_text: str) -> None: """Verify that chunk indices correctly map to the original text.""" for i, chunk in enumerate(chunks): extracted = original_text[chunk.start_index : chunk.end_index] assert chunk.text == extracted, ( f"Chunk {i} text mismatch\n" f"Chunk text: {repr(chunk.text)}\n" f"Slice text: {repr(extracted)}\n" f"Indices: [{chunk.start_index}:{chunk.end_index}]" )

wrong ai interpretation of this , adds metadata to the chunk based on sources, see demo

docs/oss/chunkers/markdown-chunker.mdx

project.mdc

gemini-code-assist · 2025-08-13T12:01:45Z

src/chonkie/chunker/markdown.py

+            ranges.append((start, first_heading_start))
+
+        # Add sentinel at end
+        matches_with_end = matches + [(end, 0)]


The variable matches_with_end is initialized here but is never used in the rest of the method. This appears to be dead code and can be removed to improve clarity.

intentional , trust me , you'll need this later (maybe) if you want to add configurations for forward or backward looking passes ;-)

tests/chunkers/test_markdown_chunker.py

removes empty file artifact

Josephrp · 2025-11-15T08:37:53Z

hello my frens & @chonk-lain , some merge conflicts , which is normal and expected after literally 3 months under review. should i take care of it , or is it possible for you to take care of it ? btw let's close this as unplanned if it's not relevant anymore :-) with all my thanks !

chonk-lain · 2025-11-17T15:46:29Z

hi @Josephrp
we're still pushing this a little bit further, will try to include this in the upcoming next big version, also I can help you fix the merge conflicts once we get to em since there has been so many updates when this pr originally first opened.

Josephrp · 2025-12-20T20:03:29Z

it's just doumentations ;-)

…p/242

chonk-lain · 2025-12-22T18:43:54Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a new MarkdownChunker, complete with documentation and tests. The overall contribution is valuable, but there are a few areas that need attention. The core implementation of the MarkdownChunker contains a critical bug in its heading-splitting logic that results in overlapping chunks and duplicated content. Additionally, the new documentation has some minor inconsistencies and omissions that should be addressed for clarity. I've provided detailed comments and suggestions to fix these issues.

gemini-code-assist · 2025-12-22T18:46:45Z

src/chonkie/chunker/markdown.py

+    def _split_on_markdown_headings(self, text: str, start: int, end: int) -> List[Tuple[int, int]]:
+        """Split a section by Markdown headings with hierarchy-aware grouping.
+
+        For level 1 headings (#), collect content until next level 1.
+        For level >= 2, collect until next heading with level <= current.
+        Also emits a leading prelude if content exists before the first heading.
+        """
+        section = text[start:end]
+        heading_re = re.compile(r"^(#{1,6})\s+.+$", flags=re.MULTILINE)
+
+        matches = [
+            (start + m.start(), len(m.group(1)))
+            for m in heading_re.finditer(section)
+        ]
+
+        if not matches:
+            # No headings: return whole section (will be size-corrected later)
+            return [(start, end)]
+
+        ranges: List[Tuple[int, int]] = []
+        # Prelude before first heading
+        first_heading_start = matches[0][0]
+        if first_heading_start > start and text[start:first_heading_start].strip():
+            ranges.append((start, first_heading_start))
+
+        # Add sentinel at end
+        matches_with_end = matches + [(end, 0)]
+
+        # Walk headings
+        for idx in range(len(matches)):
+            h_start, h_level = matches[idx]
+            # Find boundary depending on level
+            boundary = end
+            for j in range(idx + 1, len(matches)):
+                next_start, next_level = matches[j]
+                if h_level == 1 and next_level == 1:
+                    boundary = next_start
+                    break
+                if h_level >= 2 and next_level <= h_level:
+                    boundary = next_start
+                    break
+            # Emit heading section
+            if h_start < boundary and text[h_start:boundary].strip():
+                ranges.append((h_start, boundary))
+
+        return ranges


The current implementation of _split_on_markdown_headings has a critical flaw: it produces overlapping ranges. This leads to duplicated content in the final chunks, which can negatively impact downstream applications. The logic attempts a complex hierarchical grouping but fails to create disjoint sections.

I recommend replacing this with a simpler, more robust implementation that splits the text at each heading, ensuring the resulting ranges are always disjoint. This approach is correct and avoids data duplication. While it simplifies the grouping logic, correctness should be the priority. The docstring should also be updated to reflect this simpler behavior.

def _split_on_markdown_headings(self, text: str, start: int, end: int) -> List[Tuple[int, int]]: """Split a section by Markdown headings into disjoint ranges.""" section = text[start:end] heading_re = re.compile(r"^(#{1,6})\s+.+$", flags=re.MULTILINE) ranges: List[Tuple[int, int]] = [] last_split = start for match in heading_re.finditer(section): match_start = start + match.start() if match_start > last_split: if text[last_split:match_start].strip(): ranges.append((last_split, match_start)) last_split = match_start if last_split < end: if text[last_split:end].strip(): ranges.append((last_split, end)) if not ranges and section.strip(): return [(start, end)] return ranges

gemini-code-assist · 2025-12-22T18:46:45Z

docs/oss/chunkers/markdown-chunker.mdx

+    heading_level=3,        # max heading level to consider (1..6)
+    min_characters_per_chunk=50,
+    max_characters_per_section=4000,
+    clean_text=True         # strip inline markdown and URLs from chunks


The initialization example shows clean_text=True, while the parameter description below correctly states that the default is False. This could be confusing for users. To improve clarity, I suggest adding a comment to the example to clarify that this is not the default value.

clean_text=True # default: False; strip inline markdown and URLs from chunks

gemini-code-assist · 2025-12-22T18:46:45Z

docs/python-sdk/chunkers/overview.mdx

+|-----------------|:-------:|:----------:|:-----:|:-------------:|
+| TokenChunker    | ✅      | ✅         | ✅    | ✅            |
+| SentenceChunker | ✅      | ✅         | ✅    | ✅            |
+| RecursiveChunker| ✅      | ✅         | ✅    | ✅            |


The new MarkdownChunker is missing from the 'Availability' table. Since it ships with the base library and has no extra dependencies, it should be marked as available for all installation options. I recommend adding it to the table for completeness, ideally after RecursiveChunker to maintain a logical order.

| RecursiveChunker| ✅ | ✅ | ✅ | ✅ | | MarkdownChunker | ✅ | ✅ | ✅ | ✅ |

gemini-code-assist · 2025-12-22T18:46:45Z

docs/python-sdk/getting-started/introduction.mdx

+
+<Note>
+  Chonkie follows a special approach to dependencies, keeping the base installation lightweight while allowing you to add extra features as and when needed.
+  Please check the [Installation](/installation) page for more details.


There's an inconsistent link to the Installation page. Here, it's [Installation](/installation), but on line 160, a card links to /python-sdk/getting-started/installation. The latter seems more likely to be correct given the file's location in the directory structure. Please update this link for consistency.

Please check the [Installation](/python-sdk/getting-started/installation) page for more details.

gemini-code-assist · 2025-12-22T18:46:45Z

src/chonkie/chunker/markdown.py

+            return ""
+        cleaned = chunk
+        # Drop common prompt lines
+        # cleaned = re.sub(r"^(?:Search Results for:|Question:|Query:).*$", "", cleaned, flags=re.MULTILINE)


This line of code is commented out. Commented-out code reduces readability and can be confusing. If it's no longer needed, it should be removed. If it's for future use, consider adding a TODO comment with an explanation.

chonk-lain · 2025-12-22T19:06:54Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces the new MarkdownChunker, complete with implementation, tests, and documentation. The chunker itself is well-structured, following a multi-stage process to split Markdown content intelligently. My review has identified a few areas for improvement. There are some correctness issues in the Markdown cleaning logic due to problematic regular expressions, and a critical flaw in the test suite's text reconstruction logic. I've also found several inconsistencies and omissions in the new documentation files. Additionally, a significant but unrelated change to src/chonkie/__init__.py introduces an incomplete and incorrect __all__ definition, which I recommend reverting or fixing as it could break package imports. Overall, this is a great feature addition, and with these fixes, it will be a solid contribution.

gemini-code-assist · 2025-12-22T19:08:30Z

src/chonkie/__init__.py

+# Add basic package metadata to __all__
+__all__ = [
+    "__name__",
+    "__version__",
+    "__author__",
+]
+
+# Add all data classes to __all__
+__all__ += [
+    "Context",
+    "Chunk",
+    "RecursiveChunk",
+    "RecursiveLevel",
+    "RecursiveRules",
+    "SentenceChunk",
+    "SemanticChunk",
+    "Sentence",
+    "SemanticSentence",
+    "LateChunk",
+    "CodeChunk",
+    "LanguageConfig",
+    "MergeRule",
+    "SplitRule",
+]
+
+# Add all tokenizer classes to __all__
+__all__ += [
+    "Tokenizer",
+    "CharacterTokenizer",
+    "WordTokenizer",
+]
+
+# Add all chunker classes to __all__
+__all__ += [
+    "BaseChunker",
+    "TokenChunker",
+    "SentenceChunker",
+    "SemanticChunker",
+    "RecursiveChunker",
+    "LateChunker",
+    "CodeChunker",
+    "SlumberChunker",
+    "NeuralChunker",
+    "MarkdownChunker",
+]
+
+# Add all cloud classes to __all__
+__all__ += [
+    "auth",
+    "chunker",
+    "refineries",
+]
+
+# Add all embeddings classes to __all__
+__all__ += [
+    "BaseEmbeddings",
+    "Model2VecEmbeddings",
+    "SentenceTransformerEmbeddings",
+    "OpenAIEmbeddings",
+    "CohereEmbeddings",
+    "GeminiEmbeddings",
+    "AutoEmbeddings",
+    "JinaEmbeddings",
+    "VoyageAIEmbeddings",
+]
+
+# Add all refinery classes to __all__
+__all__ += [
+    "BaseRefinery",
+    "OverlapRefinery",
+    "EmbeddingsRefinery",
+]
+
+# Add all utils classes to __all__
+__all__ += [
+    "Hubbie",
+    "Visualizer",
+]
+
+# Add all genie classes to __all__
+__all__ += [
+    "BaseGenie",
+    "GeminiGenie",
+    "OpenAIGenie",
+]
+
+# Add all friends classes to __all__
+__all__ += [
+    "BasePorter",
+    "BaseHandshake",
+    "JSONPorter",
+    "ChromaHandshake",
+    "MongoDBHandshake",
+    "PgvectorHandshake",
+    "PineconeHandshake",
+    "QdrantHandshake",
+    "WeaviateHandshake",
+    "TurbopufferHandshake",
+]
+
+# Add all the chefs to __all__
+__all__ += [
+    "BaseChef",
+    "TextChef",
+]
+
+# Add all the fetchers to __all__
+__all__ += [
+    "BaseFetcher",
+    "FileFetcher",
+]


This large addition to define __all__ is both incomplete and incorrect. It references many names that are not imported into this __init__.py file, and it omits many that are. This will break wildcard imports (from chonkie import *) for users.

Given that this change is out of scope for adding a MarkdownChunker and introduces critical issues, I recommend removing it entirely from this pull request. It can be addressed in a separate PR where it can be implemented correctly and thoroughly.

gemini-code-assist · 2025-12-22T19:08:30Z

tests/chunkers/test_markdown_chunker.py

+def verify_chunk_indices(chunks: List[Chunk], original_text: str) -> None:
+    """Verify that chunk indices correctly map to the original text."""
+    reconstructed = ""
+    for i, chunk in enumerate(chunks):
+        extracted = original_text[chunk.start_index : chunk.end_index]
+        assert chunk.text == extracted, (
+            f"Chunk {i} text mismatch\n"
+            f"Chunk text: {repr(chunk.text)}\n"
+            f"Slice text: {repr(extracted)}\n"
+            f"Indices: [{chunk.start_index}:{chunk.end_index}]"
+        )
+        reconstructed += chunk.text
+    assert reconstructed == original_text
+


The assertion assert reconstructed == original_text is incorrect. The chunking process removes separators (like =====) and doesn't guarantee contiguous chunks, so concatenating them will not reproduce the original text. This will cause the test to fail.

The first assertion, which checks if chunk.text matches the slice from original_text, is a valid check for clean_text=False. The reconstruction logic should be removed.

def verify_chunk_indices(chunks: List[Chunk], original_text: str) -> None: """Verify that chunk indices correctly map to the original text.""" for i, chunk in enumerate(chunks): extracted = original_text[chunk.start_index : chunk.end_index] assert chunk.text == extracted, ( f"Chunk {i} text mismatch\n" f"Chunk text: {repr(chunk.text)}\n" f"Slice text: {repr(extracted)}\n" f"Indices: [{chunk.start_index}:{chunk.end_index}]" )

gemini-code-assist · 2025-12-22T19:08:30Z

src/chonkie/chunker/markdown.py

+        cleaned = re.sub(r"\*\*([^*]+)\*\*", r"\1", cleaned)
+        cleaned = re.sub(r"\*([^*]+)\*", r"\1", cleaned)
+        cleaned = re.sub(r"`([^`]+)`", r"\1", cleaned)


The regular expressions used to strip inline Markdown are not robust. Using [^*]+ will fail for nested formatting (e.g., **bold *and italic***). Using a non-greedy match (.*?) is a more reliable approach. Also, the order of replacement matters: ** should be processed before * to avoid incorrect substitutions.

Suggested change

cleaned = re.sub(r"\*\*([^*]+)\*\*", r"\1", cleaned)

cleaned = re.sub(r"\*([^*]+)\*", r"\1", cleaned)

cleaned = re.sub(r"`([^`]+)`", r"\1", cleaned)

cleaned = re.sub(r"\*\*(.*?)\*\*", r"\1", cleaned)

cleaned = re.sub(r"\*(.*?)\*", r"\1", cleaned)

cleaned = re.sub(r"`(.*?)`", r"\1", cleaned)

gemini-code-assist · 2025-12-22T19:08:30Z

src/chonkie/chunker/markdown.py

+        cleaned = re.sub(r"\*([^*]+)\*", r"\1", cleaned)
+        cleaned = re.sub(r"`([^`]+)`", r"\1", cleaned)
+        # Links and URLs
+        cleaned = re.sub(r"\[([^\]]+)\]\([^)]+\)", r"\1", cleaned)


The regex for stripping links \[([^\]]+)\]$[^)]+$ will fail if a URL contains parentheses, which is valid. Using a non-greedy match (.*?) for the URL part is more robust.

Suggested change

cleaned = re.sub(r"\[([^\]]+)\]$[^)]+$", r"\1", cleaned)

cleaned = re.sub(r"\[([^\]]+)\]$.*?$", r"\1", cleaned)

gemini-code-assist · 2025-12-22T19:08:30Z

docs/oss/chunkers/markdown-chunker.mdx

+    heading_level=3,        # max heading level to consider (1..6)
+    min_characters_per_chunk=50,
+    max_characters_per_section=4000,
+    clean_text=True         # strip inline markdown and URLs from chunks


The initialization example shows clean_text=True, but the parameter description on line 39 and the actual implementation in src/chonkie/chunker/markdown.py both use False as the default. This inconsistency can be misleading for users. To ensure clarity, the example should reflect the actual default value.

clean_text=False # strip inline markdown and URLs from chunks

gemini-code-assist · 2025-12-22T19:08:30Z

docs/python-sdk/chunkers/overview.mdx

+| Chunker         | Default | embeddings | 'all' | Chonkie Cloud |
+|-----------------|:-------:|:----------:|:-----:|:-------------:|
+| TokenChunker    | ✅      | ✅         | ✅    | ✅            |
+| SentenceChunker | ✅      | ✅         | ✅    | ✅            |
+| RecursiveChunker| ✅      | ✅         | ✅    | ✅            |
+| TableChunker    | ✅      | ✅         | ✅    | ✅            |
+| CodeChunker     | ❌      | ✅         | ✅    | ✅            |
+| SemanticChunker | ❌      | ✅         | ✅    | ✅            |
+| SDPMChunker     | ❌      | ✅         | ✅    | ✅            |
+| LateChunker     | ❌      | ✅         | ✅    | ✅            |
+| NeuralChunker   | ❌      | ✅         | ✅    | ✅            |
+| SlumberChunker  | ❌      | ✅         | ✅    | ✅            |


The MarkdownChunker is missing from the availability table. Since it's part of the base library with no extra dependencies, it should be included in the table to accurately reflect its availability across different installation options.

| Chunker | Default | embeddings | 'all' | Chonkie Cloud | |-----------------|:-------:|:----------:|:-----:|:-------------:| | TokenChunker | ✅ | ✅ | ✅ | ✅ | | SentenceChunker | ✅ | ✅ | ✅ | ✅ | | RecursiveChunker| ✅ | ✅ | ✅ | ✅ | | MarkdownChunker | ✅ | ✅ | ✅ | ✅ | | TableChunker | ✅ | ✅ | ✅ | ✅ | | CodeChunker | ❌ | ✅ | ✅ | ✅ | | SemanticChunker | ❌ | ✅ | ✅ | ✅ | | SDPMChunker | ❌ | ✅ | ✅ | ✅ | | LateChunker | ❌ | ✅ | ✅ | ✅ | | NeuralChunker | ❌ | ✅ | ✅ | ✅ | | SlumberChunker | ❌ | ✅ | ✅ | ✅ |

gemini-code-assist · 2025-12-22T19:08:30Z

docs/python-sdk/getting-started/introduction.mdx

+
+<Note>
+  Chonkie follows a special approach to dependencies, keeping the base installation lightweight while allowing you to add extra features as and when needed.
+  Please check the [Installation](/installation) page for more details.


This link to the installation page appears to be broken. It should point to the full path of the installation page within the python-sdk section.

Please check the [Installation](/python-sdk/getting-started/installation) page for more details.

gemini-code-assist · 2025-12-22T19:08:30Z

docs/python-sdk/getting-started/introduction.mdx

+
+1. Make a free account on [Chonkie Cloud](https://cloud.chonkie.ai)
+2. Get your API key
+3. Send your CHONK reqests! 


There is a typo in the word "requests".

3. Send your CHONK requests!

gemini-code-assist · 2025-12-22T19:08:30Z

src/chonkie/chunker/markdown.py

+            ranges.append((start, first_heading_start))
+
+        # Add sentinel at end
+        matches_with_end = matches + [(end, 0)]


The matches_with_end variable is defined but never used. This appears to be dead code and should be removed to improve clarity.

gemini-code-assist · 2025-12-22T19:08:30Z

src/chonkie/chunker/markdown.py

+            return ""
+        cleaned = chunk
+        # Drop common prompt lines
+        # cleaned = re.sub(r"^(?:Search Results for:|Question:|Query:).*$", "", cleaned, flags=re.MULTILINE)


Commented-out code should be removed to keep the codebase clean and avoid confusion.

Josephrp · 2025-12-27T11:51:49Z

lmk if you want me to do anything based on the above. in my opinion this automated review is itself poor quality and nonsensical , but dont take my word for it , just check the demo. for documentation points , i think it's out of scope here , so that's fine . you knkow during christmas & new years, it's not easy to really prioritize this , hope you understand , but it's cool y'all havent completely forgot about it or ignored it , currently we're using my fork in all the deep research agents we make.

Port MarkdownChunker, docs, tests (exclude my_contribution)

c380bef

gemini-code-assist bot reviewed Aug 13, 2025

View reviewed changes

Josephrp marked this pull request as draft August 13, 2025 10:56

Josephrp changed the title ~~Port MarkdownChunker, docs, tests (exclude my_contribution)~~ Feat : adds MarkdownChunker, docs, tests Aug 13, 2025

Josephrp marked this pull request as ready for review August 13, 2025 11:51

gemini-code-assist bot reviewed Aug 13, 2025

View reviewed changes

Josephrp and others added 6 commits August 13, 2025 15:57

Delete project.mdc

3a0f8e2

removes empty file artifact

Merge branch 'main' into feat/markdown-chunker

8603323

Merge branch 'chonkie-inc:main' into feat/markdown-chunker

d9c35a8

Merge branch 'main' into feat/markdown-chunker

94ead89

Merge branch 'main' into feat/markdown-chunker

24d11a6

Merge branch 'main' into feat/markdown-chunker

8b4bed2

chonk-lain self-requested a review December 22, 2025 18:19

Merge branch 'main' of github.com:chonkie-inc/chonkie into pr/Josephr…

3b990bf

…p/242

gemini-code-assist bot reviewed Dec 22, 2025

View reviewed changes

	cleaned = re.sub(r"\[([^\]]+)\]\([^)]+\)", r"\1", cleaned)
	cleaned = re.sub(r"\[([^\]]+)\]\(.*?\)", r"\1", cleaned)

Conversation

Josephrp commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot commented Aug 13, 2025

Uh oh!

Josephrp commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Josephrp commented Aug 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Josephrp Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Josephrp Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Josephrp commented Nov 15, 2025

Uh oh!

chonk-lain commented Nov 17, 2025

Uh oh!

Josephrp commented Dec 20, 2025

Uh oh!

chonk-lain commented Dec 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

chonk-lain commented Dec 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 22, 2025

Choose a reason for hiding this comment

Josephrp commented Aug 13, 2025 •

edited

Loading

Josephrp commented Aug 13, 2025 •

edited

Loading