feat(evals): add LLM-powered failure analysis to eval CI by maahir30 · Pull Request #2454 · langchain-ai/deepagents

Maahir Sachdev (maahir30) · 2026-04-04T00:48:38Z

When evals fail in CI, it's often unclear why from the raw pass/fail counts alone. This adds a lightweight post-eval step that uses an LLM to analyze each failure's trajectory and produce a human-readable explanation, surfaced directly in the GitHub Actions summary.

What changed:

Reporter extension (libs/evals/tests/evals/pytest_reporter.py): The pytest reporter now captures per-test failure details (test name, category, failure message including the agent trajectory) into a failures list in evals_report.json. Only populated for failures — zero overhead on passing runs.
Analysis script (.github/scripts/analyze_eval_failures.py): A new script that reads the failures from the report, sends each one to an LLM (Claude Haiku by default) for analysis via asyncio.gather (all failures analyzed in parallel), and writes the results to $GITHUB_STEP_SUMMARY and a failure_analysis.json artifact. Exits immediately with zero cost when there are no failures.
CI workflow (.github/workflows/evals.yml): Two new steps in the eval job — one to run the analysis script, one to upload the analysis artifact. Uses the existing ANTHROPIC_API_KEY secret, no new secrets needed.

initial version

bc1edeb

github-actions bot added evals feature New feature/enhancement or request for one github_actions PR touching `.github` internal User is a member of the `langchain-ai` GitHub organization size: M 200-499 LOC labels Apr 4, 2026

format

91c8e44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): add LLM-powered failure analysis to eval CI#2454

feat(evals): add LLM-powered failure analysis to eval CI#2454
Maahir Sachdev (maahir30) wants to merge 2 commits intomainfrom
llm-eval-analyzer

Maahir Sachdev (maahir30) commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Maahir Sachdev (maahir30) commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant