Skip to content

Conversation

@roomote
Copy link
Contributor

@roomote roomote bot commented Sep 12, 2025

This PR attempts to address Issue #7940 by implementing deterministic naming for Qdrant collections.

Problem

RooCode's Codebase Indexing currently creates Qdrant collections based on absolute file paths, causing:

  • Different collections for Git worktrees (each worktree has a different path)
  • Inability to share indexes between developers working on the same repository
  • Duplicated indexing work and storage

Solution

This PR implements a three-tier naming strategy for Qdrant collections:

  1. Custom collection names via .roo/codebase-index.json configuration file
  2. Git repository-based naming using the remote URL for deterministic naming across worktrees and developers
  3. Workspace path fallback for non-git repositories (original behavior)

Key Changes

  • Added support for custom collection names through .roo/codebase-index.json
  • Modified QdrantVectorStore to use git repository URL for deterministic collection naming
  • Implemented URL normalization to ensure consistent hashing across different git URL formats (SSH, HTTPS, etc.)
  • Added comprehensive test coverage with 11 new test cases
  • Improved error handling and added graceful fallbacks

Benefits

  • ✅ Same collection name across Git worktrees
  • ✅ Consistent indexing between developers working on the same repository
  • ✅ Reduced storage usage and indexing time
  • ✅ Backward compatible with existing workspace-based collections
  • ✅ Flexible configuration options for custom naming

Testing

  • All existing tests pass
  • Added comprehensive test coverage for new functionality
  • Tested with different git URL formats and edge cases

Fixes #7940

Feedback and guidance are welcome!


Important

Introduces deterministic naming for Qdrant collections using custom names, git URLs, and workspace paths, with comprehensive testing and improved error handling.

  • Behavior:
    • Implements deterministic naming for Qdrant collections in qdrant-client.ts using a three-tier strategy: custom names from .roo/codebase-index.json, git repository URLs, and workspace paths.
    • Normalizes git URLs for consistent hashing across formats (SSH, HTTPS).
    • Adds error handling and fallbacks for non-git repositories.
  • Testing:
    • Adds 11 new test cases in qdrant-client.spec.ts to cover new naming logic and edge cases.
    • Tests include scenarios for custom names, git URL normalization, and workspace path fallbacks.
  • Misc:
    • Updates QdrantVectorStore to handle new naming logic and improve error handling.

This description was created by Ellipsis for 426e703. You can customize this summary. It will automatically update as commits are pushed.

- Add support for custom collection names via .roo/codebase-index.json
- Use git repository URL for deterministic collection naming across worktrees
- Fall back to workspace path hash when no git repo is available
- Normalize git URLs for consistent hashing
- Add comprehensive tests for new naming strategies

Fixes #7940
- Add better error handling in normalizeGitUrl method
- Remove unused import of getGitRepositoryInfo
- Improve URL credential removal logic
@roomote roomote bot requested review from cte, jr and mrubens as code owners September 12, 2025 22:06
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Sep 12, 2025
Copy link
Contributor Author

@roomote roomote bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing my own code because apparently I trust no one, not even myself.

private normalizeGitUrl(url: string): string {
try {
// Remove credentials from HTTPS URLs
let normalized = url
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The credential removal logic here is good, but consider adding a comment explaining why this is important for security (preventing credential leakage in collection names). This helps future maintainers understand the security implications.

try {
const configPath = path.join(workspacePath, ".roo", "codebase-index.json")
if (fs.existsSync(configPath)) {
const config = JSON.parse(fs.readFileSync(configPath, "utf8"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this silently falls back when JSON parsing fails. Should we consider logging malformed JSON at a higher level than console.warn to help users debug configuration issues? Or perhaps add validation for the expected schema?

const configContent = fs.readFileSync(configPath, "utf8")

// Extract remote URL
const urlMatch = configContent.match(/url\s*=\s*(.+?)(?:\r?\n|$)/m)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex pattern might not handle all edge cases in git config files (e.g., URLs with spaces or special characters). Have you considered using a more robust git config parser library, or should we add more comprehensive error handling for edge cases?

* @param workspacePath Path to the workspace
* @returns Collection name
*/
private generateCollectionName(workspacePath: string): string {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding JSDoc comments to these new private methods explaining the fallback hierarchy (custom name → git URL → workspace path). This would make the code more self-documenting for future contributors.

expect((vectorStore as any).vectorSize).toBe(mockVectorSize)
})

it("should use custom collection name from .roo/codebase-index.json if available", () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great test coverage! Consider adding a few more edge case tests:

  • Malformed JSON in .roo/codebase-index.json
  • Git repositories with multiple remotes
  • Collection name length edge cases (approaching the 255 character limit)

These would help ensure robustness across different user configurations.

private loadCustomCollectionName(workspacePath: string): string | undefined {
try {
const configPath = path.join(workspacePath, ".roo", "codebase-index.json")
if (fs.existsSync(configPath)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor performance note: These synchronous file operations could potentially block the event loop. Since this only happens during initialization it's probably fine, but worth noting for future optimization if initialization performance becomes a concern.

@hannesrudolph hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Sep 12, 2025
@daniel-lxs daniel-lxs moved this from Triage to PR [Needs Prelim Review] in Roo Code Roadmap Sep 15, 2025
@hannesrudolph hannesrudolph added PR - Needs Preliminary Review and removed Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. labels Sep 15, 2025
@daniel-lxs
Copy link
Member

A possible solution is to let the user create a .rooid file, either in the workspace root or inside the .roo folder. The file would just contain one line: the collection ID.

We’d need to confirm a few things first, like how to handle simultaneous writes to the index file, so it doesn’t break.

@daniel-lxs daniel-lxs closed this Sep 17, 2025
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Sep 17, 2025
@github-project-automation github-project-automation bot moved this from PR [Needs Prelim Review] to Done in Roo Code Roadmap Sep 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request PR - Needs Preliminary Review size:L This PR changes 100-499 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Deterministic naming for Codebase Indexing Qdrant collections

4 participants