-
Notifications
You must be signed in to change notification settings - Fork 2.6k
feat: add markdown support to codebase indexing (#4660) #5378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add markdown support to codebase indexing (#4660) #5378
Conversation
|
The implementation looks solid overall. However, I noticed that it doesn't currently handle splitting very large paragraphs in markdown files. This results in large chunks being saved to Qdrant, which could lead to excessively long results when using the It's worth noting the current chunking behavior for different file types:
Given this, we may want to consider using the already implemented chunking logic or update the markdown one to split markdown paragraphs. |
|
tree-parser supports all kinds of nodes for Markdown from sections to code to lists and list items, thematic breaks, tabels, rows and cells. (ref. https://github.com/tree-sitter-grammars/tree-sitter-markdown) Just breaking up by section, lists and code blocks (inline code examples) might get us a long way to good enough @MuriloFP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed some test overlap between this file and src/services/tree-sitter/__tests__/markdownIntegration.spec.ts. Both test:
- Header parsing (basic and mixed styles)
- Files without headers
- Minimum section length handling
The overlap isn't necessarily a problem since they're testing different layers (indexing vs tree-sitter), but it might be cleaner to focus each suite on its core purpose:
- This file: code indexing features like hash generation, fallback chunking, and
MIN_BLOCK_CHARSlogic markdownIntegration.spec.ts: integration-level behavior
This would reduce redundancy and make the intent of each suite clearer. What do you think?
…duplication issue (RooCodeInc#4660) - Modified parseMarkdownContent to chunk large sections (>1150 chars) - Added support for chunking header-less markdown files - Fixed _chunkTextByLines to handle oversized lines properly - Added defensive check for parseMarkdown returning undefined - Fixed Qdrant ID generation to use segmentHash instead of file:line - This was the root cause: chunks were being deduplicated - Each chunk now gets a unique ID even from the same line - Added comprehensive tests for all edge cases - Ensures all markdown content is properly indexed in Qdrant
…m/MuriloFP/Roo-Code into feat/issue-4660-markdown-indexing
Update: Fixed Qdrant Deduplication IssueI've pushed a critical fix that addresses the root cause of the markdown chunking problem. The IssueWhile the parser was correctly creating multiple chunks for large markdown sections, these chunks were being deduplicated in Qdrant due to identical point IDs being generated for chunks from the same line. The Fix
Changes Made
The fix has been verified to work correctly - large markdown files are now properly chunked and all chunks are searchable in the codebase index. |
As identified in PR review, the supported-extensions.spec.ts file only tests the contents of an array, which is already implicitly covered by the functional tests in parser.spec.ts. Removing this reduces maintenance overhead without sacrificing test quality.
- Remove redundant test 'should handle large markdown documentation folders efficiently' that only verified the scanner could iterate over mocked files - Add test to verify unique point IDs are generated for each block from the same file, ensuring the segmentHash-based ID generation prevents collisions
- Add segmentHash to payload in scanner.ts to fix vector point ID generation - Split parser.spec.ts tests into focused unit tests (mocked dependencies) - Move integration tests to new markdownIntegration.spec.ts file - Each test suite now has clear, distinct responsibilities - Fixes issue RooCodeInc#4660: vector point ID collisions for large Markdown files
117697a to
0768c5e
Compare
daniel-lxs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM

Related GitHub Issue
Closes: #4660
Roo Code Task Context (Optional)
This PR was created using Roo Code's issue-fixer-or orchestration mode, which coordinated analysis, implementation, testing, and review across multiple specialized subtasks.
Description
This PR adds support for indexing Markdown files (
.mdand.markdown) in the codebase indexing feature. Previously, markdown files were explicitly excluded from indexing, preventing users from searching through documentation content.Key Implementation Details:
supported-extensions.tsto include.mdand.markdownfiles inscannerExtensionsparser.tsby leveraging the existingparseMarkdown()function from the tree-sitter servicemarkdown_header_h1,markdown_header_h2, etc.) for improved semantic search qualityReviewers should pay attention to:
Test Procedure
Unit Testing:
cd src npx vitest services/code-index/shared/__tests__/supported-extensions.spec.ts npx vitest services/code-index/processors/__tests__/parser.spec.ts npx vitest services/code-index/processors/__tests__/scanner.spec.tsIntegration Testing:
cd src npx vitest services/code-index/Manual Testing Steps:
✅ Markdown file detection (.md and .markdown extensions)
✅ Header extraction and section chunking
✅ Minimum size requirements enforcement
✅ Fallback chunking for headerless markdown
✅ Unique segment hash generation
✅ Header level classification
✅ Edge cases (empty files, malformed content, mixed content)
Pre-Submission Checklist
Screenshots / Videos
N/A - This is a backend feature enhancement with no UI changes.
Documentation Updates
[x] No documentation updates are required.
The feature enhancement is transparent to users - markdown files will now automatically be included in codebase indexing without requiring any configuration changes.
Additional Notes
Performance Impact: Testing shows no significant performance degradation. Markdown parsing is lightweight and the existing tree-sitter infrastructure handles the additional file types efficiently.
Backward Compatibility: All existing functionality remains unchanged. This is a pure feature addition with no breaking changes.
Future Enhancements: This implementation provides a foundation for potentially adding other documentation formats (e.g., .rst, .adoc) in the future using the same pattern.
Get in Touch
Discord: @MuriloFP
Important
Adds Markdown support to codebase indexing, including parsing and chunking logic, with comprehensive tests.
.mdand.markdownfiles by removing exclusion insupported-extensions.ts.parser.tsusingparseMarkdown().parser.spec.tsfor Markdown chunking, header extraction, and edge cases.scanner.spec.tsto ensure Markdown files are processed alongside code files.segmentHashfor unique ID generation inscanner.tsto handle multiple segments from the same file.This description was created by
for 0768c5e. You can customize this summary. It will automatically update as commits are pushed.