Skip to content

Conversation

@roomote
Copy link
Contributor

@roomote roomote bot commented Jul 31, 2025

Summary

This PR improves the error handling for Ollama embeddings to address the issue where embedding generation fails and stops the indexing process.

Changes

  • Enhanced error logging: Added detailed logging to capture the actual response from Ollama API, making it easier to diagnose issues
  • Retry logic with exponential backoff: Implemented automatic retry (up to 3 attempts) for transient failures like network issues or temporary service unavailability
  • Better error detection: Added specific handling for model not found errors and invalid response formats
  • Improved error messages: Made error messages more descriptive to help users understand what went wrong
  • Response validation: Added validation to ensure the embeddings response is in the expected format (array of arrays)

Testing

  • All existing tests pass ✅
  • The retry logic will help with transient failures
  • Better error messages will help users diagnose configuration issues

Related Issue

Fixes #6526

Notes for Reviewers

The main improvements are:

  1. When Ollama returns an unexpected response, we now log the actual response structure to help diagnose the issue
  2. Transient network failures will be automatically retried with exponential backoff
  3. Model-specific errors (like model not found) are now properly detected and reported

This should help users who are experiencing embedding generation failures with local Ollama setups.


Important

Improves error handling and adds retry logic with exponential backoff for Ollama embeddings in ollama.ts.

  • Error Handling:
    • Enhanced error logging in createEmbeddings() to capture actual response from Ollama API.
    • Added specific handling for model not found errors and invalid response formats.
    • Improved error messages for better user understanding.
  • Retry Logic:
    • Implemented retry logic with exponential backoff in createEmbeddings() for transient failures (up to 3 attempts).
    • Handles network issues, service unavailability, and other transient errors.
  • Response Validation:
    • Added validation in createEmbeddings() to ensure response is an array of arrays.
  • Telemetry:
    • Captures telemetry events for errors in createEmbeddings() and validateConfiguration().
  • Misc:
    • Adjusted timeout settings for embedding and validation requests.

This description was created by Ellipsis for b9c9d21. You can customize this summary. It will automatically update as commits are pushed.

- Add detailed error logging to capture actual Ollama API responses
- Implement retry logic with exponential backoff for transient failures
- Add better error detection for model not found errors
- Improve error messages to be more descriptive
- Add validation for embeddings response format

Fixes #6526
@roomote roomote bot requested review from cte, jr and mrubens as code owners July 31, 2025 23:37
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Jul 31, 2025

// Don't retry for certain errors
if (
error.message?.includes(t("embeddings:ollama.modelNotFound", { modelId: "" }).split(":")[0]) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the translated error message (via t()) and splitting it with split(":") to check for a model-not-found condition is brittle. Consider using custom error types or dedicated error codes for more robust error detection.

@daniel-lxs
Copy link
Member

Closing, issue is not scoped

@daniel-lxs daniel-lxs closed this Jul 31, 2025
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Jul 31, 2025
@github-project-automation github-project-automation bot moved this from Triage to Done in Roo Code Roadmap Jul 31, 2025
Copy link
Contributor Author

@roomote roomote bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roomote reviewed @roomote's code and found it needs more tests. Classic @roomote move. I've reviewed the changes and left some suggestions inline to improve the implementation.

// Add timeout to prevent indefinite hanging
const controller = new AbortController()
const timeoutId = setTimeout(() => controller.abort(), OLLAMA_EMBEDDING_TIMEOUT_MS)
for (let attempt = 1; attempt <= MAX_RETRIES; attempt++) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds significant retry logic and error handling improvements but doesn't include any unit tests. Could we add tests to verify:

  • Retry behavior with exponential backoff
  • Proper handling of different error types
  • Maximum retry limit enforcement
  • Error message validation

This is especially important for ensuring the retry logic works as expected.

// Check if we should retry
if (attempt < MAX_RETRIES) {
// Check if it's a transient error that we should retry
if (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry logic currently retries for many error types. Is it intentional to retry for all of these?

  • ENOTFOUND might indicate a configuration issue (wrong host) rather than a transient failure
  • AbortError happens on timeout - retrying might just hit the same timeout again

Consider limiting retries to truly transient errors like ECONNRESET and ETIMEDOUT?

const OLLAMA_VALIDATION_TIMEOUT_MS = 30000 // 30 seconds for validation requests

// Retry configuration
const MAX_RETRIES = 3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making these retry parameters configurable through the ApiHandlerOptions:

Suggested change
const MAX_RETRIES = 3
// Retry configuration
const MAX_RETRIES = options.ollamaMaxRetries ?? 3
const INITIAL_RETRY_DELAY_MS = options.ollamaInitialRetryDelay ?? 1000 // 1 second
const MAX_RETRY_DELAY_MS = options.ollamaMaxRetryDelay ?? 10000 // 10 seconds

This would allow users to adjust retry behavior based on their specific setup.


// Don't retry for certain errors
if (
error.message?.includes(t("embeddings:ollama.modelNotFound", { modelId: "" }).split(":")[0]) ||
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using string matching on localized error messages could be fragile. If the translation changes, this check might break. Consider using a more robust approach, perhaps by checking the error type or adding a specific error code property?

// Handle specific error types with better messages
if (lastError.name === "AbortError") {
throw new Error(t("embeddings:validation.connectionFailed"))
} else if (lastError.message?.includes("fetch failed") || (lastError as any).code === "ECONNREFUSED") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's inconsistency in error checking - sometimes using error.code (line 174) and sometimes error.message?.includes() (line 208). Consider standardizing to one approach for better maintainability?

error.code === "ETIMEDOUT" ||
error.code === "ECONNRESET"
) {
console.log(`Ollama embedding attempt ${attempt} failed, retrying in ${retryDelay}ms...`)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding telemetry for retry attempts to help monitor retry patterns:

Suggested change
console.log(`Ollama embedding attempt ${attempt} failed, retrying in ${retryDelay}ms...`)
console.log(`Ollama embedding attempt ${attempt} failed, retrying in ${retryDelay}ms...`)
TelemetryService.instance.captureEvent(TelemetryEventName.CODE_INDEX_RETRY, {
attempt,
retryDelay,
errorCode: error.code,
location: "OllamaEmbedder:createEmbeddings"
})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Failing to create code index using local setup

3 participants