Showcase: Advanced Markdown Chunker – content-aware Markdown chunking for Dify RAG #29635
asukhodko
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I’ve been working with Markdown-based knowledge bases in Dify for a while, and I kept bumping into the same issues with naive chunking: code blocks getting split in the middle, nested lists broken apart, and chunks that don’t really match how the doc is structured.
So I ended up building a plugin called Advanced Markdown Chunker and wanted to share it here.
The idea is to make Markdown chunking a bit smarter and more aligned with real-world docs. Instead of using a single fixed strategy, the plugin looks at the document and decides which of four internal strategies fits best for that particular file:
On top of that, it tries hard not to break Markdown in silly places. Code blocks, tables and lists are kept intact, and chunks “remember” which headers they belong to so you don’t lose the document structure when you index it. Neighbouring chunks also share some overlap (up to ~35%) so context doesn’t abruptly stop at a boundary.
Each chunk can optionally include some metadata like:
That tends to help a lot when you’re debugging retrieval or building filters on top of a vector store. All parsing and chunking run fully locally inside Dify – no external APIs involved and no LLM needed just to split Markdown.
This is mainly aimed at Markdown-heavy RAG setups: docs, API/SDK guides with a lot of code, changelogs / release notes, technical specs, architecture docs, etc.
Links:
Beta Was this translation helpful? Give feedback.
All reactions