Skip to content

Conversation

@hannesrudolph
Copy link
Collaborator

@hannesrudolph hannesrudolph commented Jun 23, 2025

Description

Fixes #5027

This PR adds proper support for the nomic-embed-code model in the semantic code indexing feature by implementing model-specific configurations for score thresholds and query prefixes.

Changes Made

Testing

  • All existing tests pass
  • Linting passes with no warnings
  • TypeScript compilation succeeds
  • Manual testing completed:
    • Verified nomic-embed-code model uses 0.15 score threshold instead of hardcoded 0.4
    • Confirmed query prefix is automatically applied for nomic-embed-code embeddings
    • Tested backward compatibility with existing models (no prefix applied when not specified)

Verification of Acceptance Criteria

  • Criterion 1: nomic-embed-code model now uses appropriate 0.15 score threshold instead of failing with 0.4
  • Criterion 2: Required query prefix "Represent this query for searching relevant code: " is automatically applied for nomic-embed-code
  • Criterion 3: Semantic search functionality works correctly with nomic-embed-code model
  • Criterion 4: Backward compatibility maintained for all existing embedding models

Checklist

  • Code follows project style guidelines
  • Self-review completed
  • Comments added for complex logic
  • No breaking changes introduced
  • Maintains backward compatibility
  • All lint and type checks pass

Technical Details

The solution implements a per-model configuration system that allows each embedding model to specify:

  • Custom score threshold: Optimized for the model's embedding space characteristics
  • Query prefix requirement: Essential for instruction-tuned models like nomic-embed-code

This approach ensures optimal search performance while maintaining full backward compatibility with existing models that don't require these configurations.


Important

Adds model-specific configurations for nomic-embed-code with score thresholds and query prefixes, updating embedders and configuration management.

  • Behavior:
    • Adds support for nomic-embed-code model with dimension 3584, score threshold 0.15, and query prefix in embeddingModels.ts.
    • Updates config-manager.ts to use getModelScoreThreshold() for dynamic score threshold retrieval.
    • Modifies embedders in openai.ts, ollama.ts, and openai-compatible.ts to apply query prefixes.
  • Interfaces:
    • Extends EmbeddingModelProfile with scoreThreshold and queryPrefix.
    • Makes searchMinScore required in CodeIndexConfig.
  • Utilities:
    • Adds getModelScoreThreshold() and getModelQueryPrefix() in embeddingModels.ts.

This description was created by Ellipsis for e03c03e. You can customize this summary. It will automatically update as commits are pushed.

…s and query prefixes (#5027)

- Add scoreThreshold and queryPrefix properties to embedding model profiles
- Implement nomic-embed-code model with 0.15 threshold and required query prefix
- Update config manager to use model-specific score thresholds dynamically
- Modify all embedders to apply query prefixes when required
- Maintain backward compatibility for existing models
- Fix search functionality for nomic-embed-code embeddings
Copilot AI review requested due to automatic review settings June 23, 2025 11:21
@hannesrudolph hannesrudolph requested review from cte, jr and mrubens as code owners June 23, 2025 11:21
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jun 23, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces model-specific configurations for semantic search, enabling per-model score thresholds and query prefixes—primarily adding support for the nomic-embed-code model.

  • Extended EmbeddingModelProfile with scoreThreshold and queryPrefix, added nomic-embed-code entries
  • Added getModelScoreThreshold/getModelQueryPrefix utilities and updated embedders to apply prefixes
  • Updated CodeIndexConfigManager to use dynamic search minimum score via currentSearchMinScore

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/shared/embeddingModels.ts Added scoreThreshold/queryPrefix, configured nomic-embed-code, and utility functions
src/services/code-index/interfaces/config.ts Made searchMinScore required
src/services/code-index/embedders/openai.ts Imported and applied getModelQueryPrefix
src/services/code-index/embedders/openai-compatible.ts Imported and applied getModelQueryPrefix
src/services/code-index/embedders/ollama.ts Imported and applied getModelQueryPrefix
src/services/code-index/config-manager.ts Switched to dynamic currentSearchMinScore via getModelScoreThreshold
Comments suppressed due to low confidence (2)

src/shared/embeddingModels.ts:80

  • Add unit tests for getModelScoreThreshold and getModelQueryPrefix to validate behavior across providers and models, including edge cases and the new nomic-embed-code configuration.
export function getModelScoreThreshold(provider: EmbedderProvider, modelId: string): number | undefined {

src/services/code-index/embedders/openai.ts:42

  • Add tests for createEmbeddings to ensure that the queryPrefix is correctly prepended when returned by getModelQueryPrefix, and that texts remain unmodified when no prefix is provided.
		const queryPrefix = getModelQueryPrefix("openai", modelToUse)

Comment on lines +23 to +42
"text-embedding-3-small": { dimension: 1536, scoreThreshold: 0.4 },
"text-embedding-3-large": { dimension: 3072, scoreThreshold: 0.4 },
"text-embedding-ada-002": { dimension: 1536, scoreThreshold: 0.4 },
},
ollama: {
"nomic-embed-text": { dimension: 768 },
"mxbai-embed-large": { dimension: 1024 },
"all-minilm": { dimension: 384 },
"nomic-embed-text": { dimension: 768, scoreThreshold: 0.4 },
"nomic-embed-code": {
dimension: 3584,
scoreThreshold: 0.15,
queryPrefix: "Represent this query for searching relevant code: ",
},
"mxbai-embed-large": { dimension: 1024, scoreThreshold: 0.4 },
"all-minilm": { dimension: 384, scoreThreshold: 0.4 },
// Add default Ollama model if applicable, e.g.:
// 'default': { dimension: 768 } // Assuming a default dimension
},
"openai-compatible": {
"text-embedding-3-small": { dimension: 1536 },
"text-embedding-3-large": { dimension: 3072 },
"text-embedding-ada-002": { dimension: 1536 },
"text-embedding-3-small": { dimension: 1536, scoreThreshold: 0.4 },
"text-embedding-3-large": { dimension: 3072, scoreThreshold: 0.4 },
"text-embedding-ada-002": { dimension: 1536, scoreThreshold: 0.4 },
Copy link

Copilot AI Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider extracting the repeated default scoreThreshold value (0.4) into a named constant (e.g., DEFAULT_SCORE_THRESHOLD) to avoid duplication and ease future updates.

Copilot uses AI. Check for mistakes.
"nomic-embed-code": {
dimension: 3584,
scoreThreshold: 0.15,
queryPrefix: "Represent this query for searching relevant code: ",
Copy link

Copilot AI Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract this query prefix literal into a constant (e.g., NOMINC_EMBED_CODE_PREFIX) to reduce duplication and improve readability.

Suggested change
queryPrefix: "Represent this query for searching relevant code: ",
queryPrefix: NOMIC_EMBED_CODE_PREFIX,

Copilot uses AI. Check for mistakes.
@daniel-lxs
Copy link
Member

daniel-lxs commented Jun 23, 2025

I think a better solution for the issue would be to expose the score threshold for it to be modified by users rather than adding a default score to each model.

That would require some UI changes but I think is worth it and easier to maintain in the long run.

The issue also mentions adding text for the nomic-embed-code model for the OpenAI compatible provider but this PR also adds this model for the Ollama provider.

I think we can split the issue into 2 PRs:

  1. Add the score threshold as a setting.
  2. Add support for nomic-embed-code for all supported providers and include the required text for it.

I'll be closing this PR but it can be used as a base for the 2 required PRs to solve the issue.

@daniel-lxs daniel-lxs closed this Jun 23, 2025
@github-project-automation github-project-automation bot moved this from Triage to Done in Roo Code Roadmap Jun 23, 2025
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Jun 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

nomic-embed-code for code indexing feature.

3 participants