Skip to content

Conversation

@wsxzei
Copy link

@wsxzei wsxzei commented Jan 7, 2026

Summary

Fix misleading error messages in the evaluation table by capturing and displaying the actual error messages from LLM providers instead of generic HTTP error codes.

Problem

When LLM providers return errors (e.g., rate limits, quota exceeded), the evaluation table only showed generic messages like "HTTP 429: Too Many Requests". This confused users because they couldn't tell:

  • If the issue was with Agenta's infrastructure
  • If it was their API key configuration
  • If it was their provider's quota limits
  • What they needed to do to fix it

Solution

Enhanced error handling in invoke_app (api/oss/src/services/llm_apps_service.py) to parse and display the actual provider error messages:

  1. Parse detail.message from the LLM provider's error response as the primary error message
  2. Fallback to detail.error when message is not available
  3. Use generic HTTP error info (status code + message) as last resort
  4. Extract stacktrace from multiple response formats and store both user-friendly message and technical details in PostgreSQL
  5. Improve error message prioritization: provider message > provider error > generic HTTP info

Impact

Before

Before Evaluation Table
Before Evaluation Result Detail

After

After  Evaluation Table
After Evaluation Result Detail

Changes Made

  • Users now see actionable error messages like "OpenAI rate limit exceeded" instead of "HTTP 429: Too Many Requests"
  • Technical details (status code, full response, stacktrace) are still available for debugging

Related Issues

Closes #3324

Enhance error handling in invoke_app to display actual provider error
messages instead of generic HTTP error codes. This improves user
experience by showing actionable error information (e.g., "OpenAI
rate limit exceeded" instead of "HTTP 429: Too Many Requests").

Changes:
- Parse response detail.message/stacktrace from LLM provider errors
- Preserve HTTP status code and full error response for debugging
- Add detailed stacktrace extraction from multiple response formats
- Improve error message prioritization (provider message > generic error)

Previously, evaluation table showed ambiguous error messages that
didn't explain what went wrong or how to fix it. Users can now see
the actual provider error and hover for technical details.

Related issue: Agenta-AI#3324 - [UX bug] Misleading error message in evaluation table
@vercel
Copy link

vercel bot commented Jan 7, 2026

Someone is attempting to deploy a commit to the agenta projects Team on Vercel.

A member of the Team first needs to authorize it.

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Jan 7, 2026
@CLAassistant
Copy link

CLAassistant commented Jan 7, 2026

CLA assistant check
All committers have signed the CLA.

@dosubot dosubot bot added Backend dev experience Improvement of the experience using the software. For instance better error messaging labels Jan 7, 2026
@wsxzei
Copy link
Author

wsxzei commented Jan 9, 2026

Hi @ashrafchowdury , I've completed the fix for #3324 and verified it with testing. I see the Vercel preview check is currently failing, and as a result I‘m not entirely sure about the correct next steps. Please let me know what specific changes or additions are needed from my side, and I'm happy to update accordingly.

@ashrafchowdury
Copy link
Contributor

Thanks for the contribution, @wsxzei. I believe this issue should be addressed on the frontend, as it appears to be an error in how the message is parsed. We just need to ensure that the UI displays the actual message correctly.

@wsxzei
Copy link
Author

wsxzei commented Jan 12, 2026

Thanks for the contribution, @wsxzei. I believe this issue should be addressed on the frontend, as it appears to be an error in how the message is parsed. We just need to ensure that the UI displays the actual message correctly.

Thanks for your review, @ashrafchowdury. I appreciate your perspective on this issue. I also initially thought this was a frontend issue, but after tracing the request flow during debugging, I found the problem is actually in the backend error handling. The system has two evaluation modes with different architectures:

Human Evals (frontend handles LLM invocation results)

  • Frontend's runInvocation calls the Completion service /test endpoint directly (in web/oss/src/services/evaluations/invocations/api.ts)
  • Parses the response and stores results in evaluation_results table via /preview/evaluations/results API
  • This mode works correctly since the frontend controls the full error parsing

Auto Evals (backend handles LLM invocation results)

  • Frontend triggers the evaluation task via /evaluations/preview/start (called by createEvaluation in web/oss/src/services/evaluations/api/index.ts), the task is queued to the worker service
  • Worker service handles the entire flow: invoking models → parsing responses → storing results in evaluation_results table
  • Frontend just displays what's stored in the database

The issue occurs in Auto Evals mode. In api/oss/src/services/llm_apps_service.py, the invoke_app function catches HTTP errors but generates generic messages:

except aiohttp.ClientResponseError as e:
    error_message = app_response.get("detail", {}).get(
        "error", f"HTTP error {e.status}: {e.message}"
    )

When LLM providers return quota/rate limit errors, the actual error details are in the response body. The current code doesn't extract this provider-specific message, so users see generic text like "HTTP error 429: too many requests" instead of the real error.
Since the error message is written to evaluation_results table by the backend worker before the frontend sees it, and the raw response isn't persisted, the frontend can't retrieve the original provider error. This needs to be fixed in the backend error handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend dev experience Improvement of the experience using the software. For instance better error messaging size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[UX bug] Misleading error message in evaluation table

3 participants