Skip to content

feat(evals): add LLM-powered failure analysis to eval CI#2454

Open
Maahir Sachdev (maahir30) wants to merge 2 commits intomainfrom
llm-eval-analyzer
Open

feat(evals): add LLM-powered failure analysis to eval CI#2454
Maahir Sachdev (maahir30) wants to merge 2 commits intomainfrom
llm-eval-analyzer

Conversation

@maahir30
Copy link
Copy Markdown
Contributor

When evals fail in CI, it's often unclear why from the raw pass/fail counts alone. This adds a lightweight post-eval step that uses an LLM to analyze each failure's trajectory and produce a human-readable explanation, surfaced directly in the GitHub Actions summary.

What changed:

  • Reporter extension (libs/evals/tests/evals/pytest_reporter.py): The pytest reporter now captures per-test failure details (test name, category, failure message including the agent trajectory) into a failures list in evals_report.json. Only populated for failures — zero overhead on passing runs.

  • Analysis script (.github/scripts/analyze_eval_failures.py): A new script that reads the failures from the report, sends each one to an LLM (Claude Haiku by default) for analysis via asyncio.gather (all failures analyzed in parallel), and writes the results to $GITHUB_STEP_SUMMARY and a failure_analysis.json artifact. Exits immediately with zero cost when there are no failures.

  • CI workflow (.github/workflows/evals.yml): Two new steps in the eval job — one to run the analysis script, one to upload the analysis artifact. Uses the existing ANTHROPIC_API_KEY secret, no new secrets needed.

@github-actions github-actions bot added evals feature New feature/enhancement or request for one github_actions PR touching `.github` internal User is a member of the `langchain-ai` GitHub organization size: M 200-499 LOC labels Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

evals feature New feature/enhancement or request for one github_actions PR touching `.github` internal User is a member of the `langchain-ai` GitHub organization size: M 200-499 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant