feat: add markdown support to codebase indexing (#4660) #5378

MuriloFP · 2025-07-03T17:10:19Z

Related GitHub Issue

Closes: #4660

Roo Code Task Context (Optional)

This PR was created using Roo Code's issue-fixer-or orchestration mode, which coordinated analysis, implementation, testing, and review across multiple specialized subtasks.

Description

This PR adds support for indexing Markdown files (.md and .markdown) in the codebase indexing feature. Previously, markdown files were explicitly excluded from indexing, preventing users from searching through documentation content.

Key Implementation Details:

Extension Filter: Removed the explicit markdown filter in supported-extensions.ts to include .md and .markdown files in scannerExtensions
Parser Integration: Added markdown file detection and processing in parser.ts by leveraging the existing parseMarkdown() function from the tree-sitter service
Semantic Chunking: Implemented intelligent section chunking based on markdown headers with header level classification (markdown_header_h1, markdown_header_h2, etc.) for improved semantic search quality
Fallback Logic: Added robust fallback chunking for headerless markdown files to ensure all content is indexable

Reviewers should pay attention to:

The seamless integration with the existing tree-sitter pipeline
The comprehensive test coverage including edge cases
The header level extraction logic that enhances search semantics

Test Procedure

Unit Testing:

cd src
npx vitest services/code-index/shared/__tests__/supported-extensions.spec.ts
npx vitest services/code-index/processors/__tests__/parser.spec.ts  
npx vitest services/code-index/processors/__tests__/scanner.spec.ts

Integration Testing:

cd src
npx vitest services/code-index/

Manual Testing Steps:

Create a directory with mixed content (.ts, .js, .md files)
Run the codebase indexing on this directory
Verify that markdown files are now processed and indexed
Test semantic search on markdown content to ensure headers and sections are properly chunked
Test Coverage Verification:

✅ Markdown file detection (.md and .markdown extensions)
✅ Header extraction and section chunking
✅ Minimum size requirements enforcement
✅ Fallback chunking for headerless markdown
✅ Unique segment hash generation
✅ Header level classification
✅ Edge cases (empty files, malformed content, mixed content)

Pre-Submission Checklist

Issue Linked: This PR is linked to an approved GitHub Issue (see "Related GitHub Issue" above).
Scope: My changes are focused on the linked issue (one major feature/fix per PR).
Self-Review: I have performed a thorough self-review of my code.
Testing: New and/or updated tests have been added to cover my changes (if applicable).
Documentation Impact: I have considered if my changes require documentation updates (see "Documentation Updates" section below).
Contribution Guidelines: I have read and agree to the Contributor Guidelines.

Screenshots / Videos

N/A - This is a backend feature enhancement with no UI changes.

Documentation Updates

[x] No documentation updates are required.
The feature enhancement is transparent to users - markdown files will now automatically be included in codebase indexing without requiring any configuration changes.

Additional Notes

Performance Impact: Testing shows no significant performance degradation. Markdown parsing is lightweight and the existing tree-sitter infrastructure handles the additional file types efficiently.

Backward Compatibility: All existing functionality remains unchanged. This is a pure feature addition with no breaking changes.

Future Enhancements: This implementation provides a foundation for potentially adding other documentation formats (e.g., .rst, .adoc) in the future using the same pattern.

Get in Touch

Discord: @MuriloFP

Important

Adds Markdown support to codebase indexing, including parsing and chunking logic, with comprehensive tests.

Behavior:
- Adds support for indexing .md and .markdown files by removing exclusion in supported-extensions.ts.
- Integrates Markdown parsing in parser.ts using parseMarkdown().
- Implements chunking based on headers and fallback logic for headerless files.
Tests:
- Adds tests in parser.spec.ts for Markdown chunking, header extraction, and edge cases.
- Adds tests in scanner.spec.ts to ensure Markdown files are processed alongside code files.
Misc:
- Uses segmentHash for unique ID generation in scanner.ts to handle multiple segments from the same file.

^{This description was created by}^{for 0768c5e. You can customize this summary. It will automatically update as commits are pushed.}

daniel-lxs · 2025-07-03T18:02:31Z

The implementation looks solid overall. However, I noticed that it doesn't currently handle splitting very large paragraphs in markdown files. This results in large chunks being saved to Qdrant, which could lead to excessively long results when using the codebase_search tool:

It's worth noting the current chunking behavior for different file types:

Regular code files
- Minimum chunk size: 100 characters
- Maximum chunk size: 1150 characters (with a 1.15 tolerance on the 1000-character target)
- Large code blocks are automatically chunked
Markdown files
- Minimum chunk size: 100 characters
- Maximum: No enforced limit; sections are defined by the content between headers
- No chunking is applied to large markdown sections

Given this, we may want to consider using the already implemented chunking logic or update the markdown one to split markdown paragraphs.

adamhill · 2025-07-03T18:38:33Z

tree-parser supports all kinds of nodes for Markdown from sections to code to lists and list items, thematic breaks, tabels, rows and cells. (ref. https://github.com/tree-sitter-grammars/tree-sitter-markdown)

Just breaking up by section, lists and code blocks (inline code examples) might get us a long way to good enough @MuriloFP

daniel-lxs · 2025-07-03T21:03:48Z

src/services/code-index/processors/__tests__/parser.spec.ts

I noticed some test overlap between this file and src/services/tree-sitter/__tests__/markdownIntegration.spec.ts. Both test:

Header parsing (basic and mixed styles)

Files without headers

Minimum section length handling

The overlap isn't necessarily a problem since they're testing different layers (indexing vs tree-sitter), but it might be cleaner to focus each suite on its core purpose:

This file: code indexing features like hash generation, fallback chunking, and MIN_BLOCK_CHARS logic

markdownIntegration.spec.ts: integration-level behavior

This would reduce redundancy and make the intent of each suite clearer. What do you think?

…duplication issue (RooCodeInc#4660) - Modified parseMarkdownContent to chunk large sections (>1150 chars) - Added support for chunking header-less markdown files - Fixed _chunkTextByLines to handle oversized lines properly - Added defensive check for parseMarkdown returning undefined - Fixed Qdrant ID generation to use segmentHash instead of file:line - This was the root cause: chunks were being deduplicated - Each chunk now gets a unique ID even from the same line - Added comprehensive tests for all edge cases - Ensures all markdown content is properly indexed in Qdrant

…m/MuriloFP/Roo-Code into feat/issue-4660-markdown-indexing

MuriloFP · 2025-07-04T01:33:21Z

Update: Fixed Qdrant Deduplication Issue

I've pushed a critical fix that addresses the root cause of the markdown chunking problem.

The Issue

While the parser was correctly creating multiple chunks for large markdown sections, these chunks were being deduplicated in Qdrant due to identical point IDs being generated for chunks from the same line.

The Fix

Modified scanner.ts to use segmentHash instead of ilePath:lineNumber for generating unique Qdrant point IDs
This ensures each chunk gets a unique ID and is properly stored in the vector database
Added comprehensive tests to prevent regression

Changes Made

parser.ts: Enhanced markdown chunking logic with proper handling of headers, header-less files, and oversized lines
scanner.ts: Fixed Qdrant point ID generation to prevent deduplication
parser.spec.ts: Added extensive unit tests covering all edge cases

The fix has been verified to work correctly - large markdown files are now properly chunked and all chunks are searchable in the codebase index.

As identified in PR review, the supported-extensions.spec.ts file only tests the contents of an array, which is already implicitly covered by the functional tests in parser.spec.ts. Removing this reduces maintenance overhead without sacrificing test quality.

- Remove redundant test 'should handle large markdown documentation folders efficiently' that only verified the scanner could iterate over mocked files - Add test to verify unique point IDs are generated for each block from the same file, ensuring the segmentHash-based ID generation prevents collisions

- Add segmentHash to payload in scanner.ts to fix vector point ID generation - Split parser.spec.ts tests into focused unit tests (mocked dependencies) - Move integration tests to new markdownIntegration.spec.ts file - Each test suite now has clear, distinct responsibilities - Fixes issue RooCodeInc#4660: vector point ID collisions for large Markdown files

…segment hashing

daniel-lxs

LGTM

feat: add markdown support to codebase indexing (RooCodeInc#4660)

28745d1

MuriloFP requested review from cte, jr and mrubens as code owners July 3, 2025 17:10

github-project-automation bot added this to Roo Code Roadmap and Roo Code Roadmap Jul 3, 2025

github-project-automation bot moved this to Triage in Roo Code Roadmap Jul 3, 2025

github-project-automation bot moved this to New in Roo Code Roadmap Jul 3, 2025

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Jul 3, 2025

hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Jul 3, 2025

mrubens approved these changes Jul 3, 2025

View reviewed changes

daniel-lxs moved this from Triage to PR [Needs Prelim Review] in Roo Code Roadmap Jul 3, 2025

dosubot bot added the lgtm This PR has been approved by a maintainer label Jul 3, 2025

hannesrudolph added PR - Needs Preliminary Review and removed Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. labels Jul 3, 2025

daniel-lxs moved this from PR [Needs Prelim Review] to PR [Changes Requested] in Roo Code Roadmap Jul 3, 2025

hannesrudolph added PR - Changes Requested and removed PR - Needs Preliminary Review labels Jul 3, 2025

Merge branch 'RooCodeInc:main' into feat/issue-4660-markdown-indexing

d3d1d75

daniel-lxs reviewed Jul 3, 2025

View reviewed changes

MuriloFP and others added 3 commits July 3, 2025 22:26

Merge branch 'RooCodeInc:main' into feat/issue-4660-markdown-indexing

a72f407

Merge branch 'feat/issue-4660-markdown-indexing' of https://github.co…

fae4460

…m/MuriloFP/Roo-Code into feat/issue-4660-markdown-indexing

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 4, 2025

MuriloFP marked this pull request as draft July 4, 2025 01:34

MuriloFP added 3 commits July 3, 2025 22:39

MuriloFP force-pushed the feat/issue-4660-markdown-indexing branch from 117697a to 0768c5e Compare July 4, 2025 02:33

MuriloFP marked this pull request as ready for review July 4, 2025 02:35

daniel-lxs moved this from PR [Changes Requested] to PR [Needs Prelim Review] in Roo Code Roadmap Jul 4, 2025

hannesrudolph added PR - Needs Preliminary Review and removed PR - Changes Requested labels Jul 4, 2025

daniel-lxs added 2 commits July 4, 2025 12:50

refactor: move redundant tests

4eaf223

feat: enhance markdown processing with consistent chunking logic and …

5e74d28

…segment hashing

daniel-lxs approved these changes Jul 4, 2025

View reviewed changes

mrubens approved these changes Jul 4, 2025

View reviewed changes

mrubens merged commit a83e8c0 into RooCodeInc:main Jul 4, 2025
11 checks passed

github-project-automation bot moved this from New to Done in Roo Code Roadmap Jul 4, 2025

github-project-automation bot moved this from PR [Needs Prelim Review] to Done in Roo Code Roadmap Jul 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add markdown support to codebase indexing (#4660) #5378

feat: add markdown support to codebase indexing (#4660) #5378

Uh oh!

MuriloFP commented Jul 3, 2025 •

edited by ellipsis-dev bot

Loading

Uh oh!

daniel-lxs commented Jul 3, 2025

Uh oh!

adamhill commented Jul 3, 2025

Uh oh!

daniel-lxs Jul 3, 2025

Uh oh!

MuriloFP commented Jul 4, 2025

Uh oh!

daniel-lxs left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat: add markdown support to codebase indexing (#4660) #5378

feat: add markdown support to codebase indexing (#4660) #5378

Uh oh!

Conversation

MuriloFP commented Jul 3, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related GitHub Issue

Roo Code Task Context (Optional)

Description

Test Procedure

Pre-Submission Checklist

Screenshots / Videos

Documentation Updates

Additional Notes

Get in Touch

Uh oh!

daniel-lxs commented Jul 3, 2025

Uh oh!

adamhill commented Jul 3, 2025

Uh oh!

daniel-lxs Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

MuriloFP commented Jul 4, 2025

Update: Fixed Qdrant Deduplication Issue

The Issue

The Fix

Changes Made

Uh oh!

daniel-lxs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MuriloFP commented Jul 3, 2025 •

edited by ellipsis-dev bot

Loading