Skip to content

Conversation

@MuriloFP
Copy link
Contributor

@MuriloFP MuriloFP commented Jul 3, 2025

Related GitHub Issue

Closes: #4660

Roo Code Task Context (Optional)

This PR was created using Roo Code's issue-fixer-or orchestration mode, which coordinated analysis, implementation, testing, and review across multiple specialized subtasks.

Description

This PR adds support for indexing Markdown files (.md and .markdown) in the codebase indexing feature. Previously, markdown files were explicitly excluded from indexing, preventing users from searching through documentation content.

Key Implementation Details:

  • Extension Filter: Removed the explicit markdown filter in supported-extensions.ts to include .md and .markdown files in scannerExtensions
  • Parser Integration: Added markdown file detection and processing in parser.ts by leveraging the existing parseMarkdown() function from the tree-sitter service
  • Semantic Chunking: Implemented intelligent section chunking based on markdown headers with header level classification (markdown_header_h1, markdown_header_h2, etc.) for improved semantic search quality
  • Fallback Logic: Added robust fallback chunking for headerless markdown files to ensure all content is indexable

Reviewers should pay attention to:

  • The seamless integration with the existing tree-sitter pipeline
  • The comprehensive test coverage including edge cases
  • The header level extraction logic that enhances search semantics

Test Procedure

Unit Testing:

cd src
npx vitest services/code-index/shared/__tests__/supported-extensions.spec.ts
npx vitest services/code-index/processors/__tests__/parser.spec.ts  
npx vitest services/code-index/processors/__tests__/scanner.spec.ts

Integration Testing:

cd src
npx vitest services/code-index/

Manual Testing Steps:

  • Create a directory with mixed content (.ts, .js, .md files)
  • Run the codebase indexing on this directory
  • Verify that markdown files are now processed and indexed
  • Test semantic search on markdown content to ensure headers and sections are properly chunked
  • Test Coverage Verification:

✅ Markdown file detection (.md and .markdown extensions)
✅ Header extraction and section chunking
✅ Minimum size requirements enforcement
✅ Fallback chunking for headerless markdown
✅ Unique segment hash generation
✅ Header level classification
✅ Edge cases (empty files, malformed content, mixed content)

Pre-Submission Checklist

  • Issue Linked: This PR is linked to an approved GitHub Issue (see "Related GitHub Issue" above).
  • Scope: My changes are focused on the linked issue (one major feature/fix per PR).
  • Self-Review: I have performed a thorough self-review of my code.
  • Testing: New and/or updated tests have been added to cover my changes (if applicable).
  • Documentation Impact: I have considered if my changes require documentation updates (see "Documentation Updates" section below).
  • Contribution Guidelines: I have read and agree to the Contributor Guidelines.

Screenshots / Videos

N/A - This is a backend feature enhancement with no UI changes.

Documentation Updates

[x] No documentation updates are required.
The feature enhancement is transparent to users - markdown files will now automatically be included in codebase indexing without requiring any configuration changes.

Additional Notes

Performance Impact: Testing shows no significant performance degradation. Markdown parsing is lightweight and the existing tree-sitter infrastructure handles the additional file types efficiently.

Backward Compatibility: All existing functionality remains unchanged. This is a pure feature addition with no breaking changes.

Future Enhancements: This implementation provides a foundation for potentially adding other documentation formats (e.g., .rst, .adoc) in the future using the same pattern.

Get in Touch

Discord: @MuriloFP


Important

Adds Markdown support to codebase indexing, including parsing and chunking logic, with comprehensive tests.

  • Behavior:
    • Adds support for indexing .md and .markdown files by removing exclusion in supported-extensions.ts.
    • Integrates Markdown parsing in parser.ts using parseMarkdown().
    • Implements chunking based on headers and fallback logic for headerless files.
  • Tests:
    • Adds tests in parser.spec.ts for Markdown chunking, header extraction, and edge cases.
    • Adds tests in scanner.spec.ts to ensure Markdown files are processed alongside code files.
  • Misc:
    • Uses segmentHash for unique ID generation in scanner.ts to handle multiple segments from the same file.

This description was created by Ellipsis for 0768c5e. You can customize this summary. It will automatically update as commits are pushed.

@MuriloFP MuriloFP requested review from cte, jr and mrubens as code owners July 3, 2025 17:10
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Jul 3, 2025
@hannesrudolph hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Jul 3, 2025
@daniel-lxs daniel-lxs moved this from Triage to PR [Needs Prelim Review] in Roo Code Roadmap Jul 3, 2025
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jul 3, 2025
@hannesrudolph hannesrudolph added PR - Needs Preliminary Review and removed Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. labels Jul 3, 2025
@daniel-lxs
Copy link
Member

The implementation looks solid overall. However, I noticed that it doesn't currently handle splitting very large paragraphs in markdown files. This results in large chunks being saved to Qdrant, which could lead to excessively long results when using the codebase_search tool:

image

It's worth noting the current chunking behavior for different file types:

  • Regular code files

    • Minimum chunk size: 100 characters
    • Maximum chunk size: 1150 characters (with a 1.15 tolerance on the 1000-character target)
    • Large code blocks are automatically chunked
  • Markdown files

    • Minimum chunk size: 100 characters
    • Maximum: No enforced limit; sections are defined by the content between headers
    • No chunking is applied to large markdown sections

Given this, we may want to consider using the already implemented chunking logic or update the markdown one to split markdown paragraphs.

@daniel-lxs daniel-lxs moved this from PR [Needs Prelim Review] to PR [Changes Requested] in Roo Code Roadmap Jul 3, 2025
@adamhill
Copy link
Contributor

adamhill commented Jul 3, 2025

tree-parser supports all kinds of nodes for Markdown from sections to code to lists and list items, thematic breaks, tabels, rows and cells. (ref. https://github.com/tree-sitter-grammars/tree-sitter-markdown)

Just breaking up by section, lists and code blocks (inline code examples) might get us a long way to good enough @MuriloFP

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed some test overlap between this file and src/services/tree-sitter/__tests__/markdownIntegration.spec.ts. Both test:

  • Header parsing (basic and mixed styles)
  • Files without headers
  • Minimum section length handling

The overlap isn't necessarily a problem since they're testing different layers (indexing vs tree-sitter), but it might be cleaner to focus each suite on its core purpose:

  • This file: code indexing features like hash generation, fallback chunking, and MIN_BLOCK_CHARS logic
  • markdownIntegration.spec.ts: integration-level behavior

This would reduce redundancy and make the intent of each suite clearer. What do you think?

MuriloFP and others added 3 commits July 3, 2025 22:26
…duplication issue (RooCodeInc#4660)

- Modified parseMarkdownContent to chunk large sections (>1150 chars)
- Added support for chunking header-less markdown files
- Fixed _chunkTextByLines to handle oversized lines properly
- Added defensive check for parseMarkdown returning undefined
- Fixed Qdrant ID generation to use segmentHash instead of file:line
  - This was the root cause: chunks were being deduplicated
  - Each chunk now gets a unique ID even from the same line
- Added comprehensive tests for all edge cases
- Ensures all markdown content is properly indexed in Qdrant
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 4, 2025
@MuriloFP
Copy link
Contributor Author

MuriloFP commented Jul 4, 2025

Update: Fixed Qdrant Deduplication Issue

I've pushed a critical fix that addresses the root cause of the markdown chunking problem.

The Issue

While the parser was correctly creating multiple chunks for large markdown sections, these chunks were being deduplicated in Qdrant due to identical point IDs being generated for chunks from the same line.

The Fix

  • Modified scanner.ts to use segmentHash instead of ilePath:lineNumber for generating unique Qdrant point IDs
  • This ensures each chunk gets a unique ID and is properly stored in the vector database
  • Added comprehensive tests to prevent regression

Changes Made

  1. parser.ts: Enhanced markdown chunking logic with proper handling of headers, header-less files, and oversized lines
  2. scanner.ts: Fixed Qdrant point ID generation to prevent deduplication
  3. parser.spec.ts: Added extensive unit tests covering all edge cases

The fix has been verified to work correctly - large markdown files are now properly chunked and all chunks are searchable in the codebase index.

@MuriloFP MuriloFP marked this pull request as draft July 4, 2025 01:34
MuriloFP added 3 commits July 3, 2025 22:39
As identified in PR review, the supported-extensions.spec.ts file only tests
the contents of an array, which is already implicitly covered by the functional
tests in parser.spec.ts. Removing this reduces maintenance overhead without
sacrificing test quality.
- Remove redundant test 'should handle large markdown documentation folders efficiently'
  that only verified the scanner could iterate over mocked files
- Add test to verify unique point IDs are generated for each block from the same file,
  ensuring the segmentHash-based ID generation prevents collisions
- Add segmentHash to payload in scanner.ts to fix vector point ID generation
- Split parser.spec.ts tests into focused unit tests (mocked dependencies)
- Move integration tests to new markdownIntegration.spec.ts file
- Each test suite now has clear, distinct responsibilities
- Fixes issue RooCodeInc#4660: vector point ID collisions for large Markdown files
@MuriloFP MuriloFP force-pushed the feat/issue-4660-markdown-indexing branch from 117697a to 0768c5e Compare July 4, 2025 02:33
@MuriloFP MuriloFP marked this pull request as ready for review July 4, 2025 02:35
@daniel-lxs daniel-lxs moved this from PR [Changes Requested] to PR [Needs Prelim Review] in Roo Code Roadmap Jul 4, 2025
Copy link
Member

@daniel-lxs daniel-lxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mrubens mrubens merged commit a83e8c0 into RooCodeInc:main Jul 4, 2025
11 checks passed
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Jul 4, 2025
@github-project-automation github-project-automation bot moved this from PR [Needs Prelim Review] to Done in Roo Code Roadmap Jul 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request lgtm This PR has been approved by a maintainer PR - Needs Preliminary Review size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Codebase Indexing Fails to Scan and Index Directory with Markdown Documentation Files

5 participants